I'm trying to make a simple regression using XGBoost but unfortunately it keeps outputting a constant value for each X value I'm feeding it
The code (dummy version of my problem) is the following :
x = np.array(np.arange(0,10,0.2),ndmin=2).T
y = np.sin(x**2)
n_week = 40 # length of training data
n_pred = 5
len_pred = 2
n_est = 50
depth = 6
eta = 0.3
xgb_model = xgb.XGBRegressor(objective='reg:squarederror', random_state=17, n_estimators = n_est, max_depth = depth, eta = eta)
xgb_model.fit(x[:n_week],y[:n_week])
pred = xgb_model.predict(x[n_week:][[i*len_pred for i in range(n_pred)]])
plt.plot(range(len(x)),x)
plt.plot([*range(n_week),*[n_week-1+i for i in [(i+1)*len_pred for i in range(n_pred)]]],[*y[:n_week],*pred])
plt.plot(range(len(x)),y,color = 'green', linestyle = 'dashed')
plt.show()
The output is the following :
So we have X in blue, Y in dashed green, and in orange is Y until where the predictions start and then the predictions where it goes to -0.89199615 and keeps this value. I would expect it to at least "move". My base problem is even easier as the input is highly correlated to the output...
My prediction output is : [-0.89199615 -0.89199615 -0.89199615 -0.89199615 -0.89199615]
Does anyone have any idea of what's going on ? I did this a few weeks ago and it worked perfectly, and now I get this very weird output. I tried to specify many different things in my XGBRegressor to see if it was that but I'm shit out of luck.
Thanks a lot
EDIT :
Guess I got it and I'm just stupid : as it's tree based and it's seeing data it's never seen before (especially on 1D data ?), it can't output things it hasn't learned from ? And thus it's not a suitable way to try to predict what I'm trying to predict ? Guess I should stick with regression but unfortunately it's not giving me a satisfactory answer.
If someone can confirm, it'll make me feel better :)
Thanks
Related
Good evening,
I'm currently pursuing a PhD in chemistry and in this framework I'm trying to apply my few knowledge in python and stats to discriminate sample based on their IR spectrum.
After a few of weeks of data acquisition I'm finally able to build my data set and was about to see what PCA can offer (this was the easy part).
I was able to build my script and get the loadings, scores and everything else that I could possibly need or want. However I used the StandardScaler from sklearn.preprocessing to scale down my data so (correct my if i'm wrong) I should get back loadings in this "standard scaled" space.
As my data are actual IR spectra those loadings have a chemical meanings (even thought there are not real spectrum) e.g. if my PC1 loadings have a peak at XX cm-1 i know that samples with high PC1 are likely to contain compounds that absorb at this wavenumber .
So i want to reverse the StandardScaler transformation. I've tried to used StandardScaler.inverse_transform() however it appears to return me the same array that I gave him... which is very frustrating...
I'm trying to do the same thing with my samples spectrum but it gave me the same result again : here is the portion of my script where I tried this :
Wavenumbers = DFF.columns
#in fact this is a little more complicated but that's the spirit
Spectre = DFF.values.tolist()
#btw DFF is my pandas.dataframe containing spectrum with features = wavenumber
SS = StandardScaler(copy=True)
DFF = SS.fit_transform(DFF) #at this point I use SS for preprocessing before PCA
#I'm then trying to inverse SS and get back the 1rst spectrum of the dataset
D = SS.inverse_transform(DFF[0])
#However at this point DFF[0] and D are almost-exactly the same I'm sure because :
plt.plot(Wavenumbers,D)
plt.plot(Wavenumbers,DFF[0]) #the curves are the sames, and :
for i,j in enumerate(D) :
if j==DFF[0][i] : pass
else : print("{}".format(j-DFF[0][i] )) #return nothing bigger than 10e-16
The problem is more than likely syntax or how i used StandardScaler, however i have no one around me to search for help with that . Can anyone tell me what i did wrong ? or give me an hint on how i could get back my loadings in the "actual real IR spectra" space ?
PS: sorry for the wacky English and i hope to be understandable
Good evening,
After putting the problem aside for a few days I finally re-coded the function I needed (as suggested by Robert Dodier).
For reminder, I wanted to have a function that could take my data from a pandas dataframe and mean-centered it in order to do PCA, but also that could reverse the preprocessing for latter uses.
Here is the code I ended up with :
import pandas as pd
import numpy as np
class Scaler:
std =[]
mean = []
def fit(self,DF):
self.std=[]
self.mean=[]
for c in DF.columns:
self.std.append(DF[c].std())
self.mean.append(DF[c].mean())
def transform(self,DF):
X = np.zeros(shape=DF.shape)
for i,c in enumerate(DF.columns):
for j in range(len(DF.index)):
X[j][i] = (DF[c][j] - self.mean[i]) / self.std[i]
return X
def reverse(self,X):
Y = np.zeros(shape=X.shape)
for i in range(len(X[0])):
for j in range(len(X)):
Y[j][i] = X[j][i] * self.std[i] + self.mean[i]
return Y
def fit_transform(self,DF):
self.fit(DF)
X = self.transform(DF)
return X
It's pretty slow and surely very low-tech but it seems to do the job just fine. Hope it will save some time to other python beginners.
I designed it to be as close as I think sklearn.preprocessing.StandardScaler does it.
example :
S = Scaler() #create scaler object
S.fit(DF) #fit the scaler to the dataframe (calculate mean and std for every columns in DF /!\ DF must be a pd.dataframe)
X=S.transform(DF) # return a np.array with mean centered data
Y = S.reverse(X) # reverse the transformation to get back original data
Again sorry for the fast tipped English. And thanks to Robert for taking the time to answer.
when I ran the StellarGraph's demo on graph classification using DGCNNs, I got the same result as in the demo.
However, when I tested what happens when I first shuffle the data using the following code:
shuffler = list(zip(graphs, graph_labels))
random.shuffle(shuffler)
graphs, graph_labels = zip(*shuffler)
The model didn't learn at all (accuracy of around 50% - just as data distribution).
Does anyone know why this happens? Maybe I shuffled in a wrong way? Or is it that the data should be unshuffled in the first place (also why? it doesn't make any sense)? Or is it a bug in StellarGraph's implementation?
I found the problem. It wasn't anything to do with the shuffling algorithm, nor with StellarGraph's implementation. The problem was in the demo, at the following lines:
train_gen = gen.flow(
list(train_graphs.index - 1),
targets=train_graphs.values,
batch_size=50,
symmetric_normalization=False,
)
test_gen = gen.flow(
list(test_graphs.index - 1),
targets=test_graphs.values,
batch_size=1,
symmetric_normalization=False,
)
The problem was caused, specifically by train_graphs.index - 1 and test_graphs.index - 1. The indices are already in the range between 0 to n, so subtructing one from them would cause the graph data to "shift" one backwards, causing each data point to get the label of a different data point.
To fix this, simply change them to train_graphs.index and test_graphs.index without the -1 at the end.
This question already exists:
Predicting Values from an array in python
Closed 1 year ago.
I am looking for a solution to the following problem. I am hoping to use a regression model to predict some values. I can currently use it to predict a single value i.e. the model will predict the y-value for the x-value equal to 10 for example. What I would like to do is use an array as the input value. I am hoping I can get the model to create a new array of predicted values when the second array is used as the input value. Is this possible? I have attached some code hoping that this helps.
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=3)
wspoly = poly_reg.fit_transform(ws)
lin_reg2 = LinearRegression()
lin_reg2.fit(wspoly,elec)
plt.figure(figsize = (25,15))
x_grid = np.arange(min(ws), max(ws), 0.1)
x_grid = x_grid.reshape(len(x_grid), 1)
plt.scatter(ws, elec, color = 'red')
plt.plot(x_grid, lin_reg2.predict(poly_reg.fit_transform(x_grid)), color = 'blue')
plt.title("Polynomial Regression of Wind Speed & Electricty Generated")
plt.xlabel('Wind Speed (m/s)')
plt.ylabel('Electricity Generated (kWh)')
plt.show()
Output from the above code showing the polynomial regression model
prediction = lin_reg2.predict(poly_reg.fit_transform([[10]]))
Start out by creating a function to return the prediction. The function will look like this.
def my_func(x_value):
# your code goes here
return y_value
Then you can create an array of predictions like this.
y_values = [my_func(x) for x in list_of_x_values]
I'm trying to deliver shap decision plots for a small subset of predictions but the outputs found by shap are different than what I'm getting when just using the model to predict even with using link = 'logit' in the call. The result of every decision plot I'm trying to produce should be greater than the expected value due to the subset I'm trying to plot. However, every single plot produced has a predicted value lower than the expected value.
I have two models that are in a minimum ensemble so I'm using a for loop to determine which model to produce a plot for. I'm having no issue creating the correct plots for the RandomForestClassifier model, but the issue is occurring for the XGB model.
rf_explainer = shap.TreeExplainer(RF_model)
xgb_explainer = shap.TreeExplainer(XGB_model)
for i in range(flagged.shape[0]):
if flagged_preds.RF_Score[i] == flagged_preds.Ensemble_Score[i]:
idx = flagged.index[i]
idxstr = idx[1].astype('str') + ' -- ' + idx[2].date().strftime('%Y-%m-%d') + ' -- ' + idx[0].astype('str')
shap_value = rf_explainer.shap_values(flagged.iloc[i,:])
shap.decision_plot(rf_explainer.expected_value[1], shap_value[1], show=False)
plt.savefig(f'//PathToFolder/{idxstr} -- RF.jpg', format = 'jpg', bbox_inches = 'tight', facecolor = 'white')
if flagged_preds.XGB_Score[i] == flagged_preds.Ensemble_Score[i]:
idx = flagged.index[i]
idxstr = idx[1].astype('str') + ' -- ' + idx[2].date().strftime('%Y-%m-%d') + ' -- ' + idx[0].astype('str')
shap_value = xgb_explainer.shap_values(flagged.iloc[i,:])
shap.decision_plot(xgb_explainer.expected_value, shap_value, link = 'logit', show=False)
plt.savefig(f'//PathToFolder/{idxstr} -- XGB.jpg', format = 'jpg', bbox_inches = 'tight', facecolor = 'white')
plt.close()
As mentioned before, when scoring, each observation (of the ones I'm concerned with) should have a score > .5 but that isn't what I'm seeing in my shap plots. Here is an example:
This plot shows an output of about .1 but when scoring this observation using predict_proba, I get a value of .608
I can't really provide a reprex due to the sensitive nature of the data and I'm not sure what the underlying issue is.
Any feedback would be very welcome, Thank you.
Relevant pip freeze items:
Python 3.7.3
matplotlib==3.0.3
shap==0.30.1
xgboost==0.90
I suggest making a direct comparison between your model output and your SHAP output.
The second code example in Section "Changing the SHAP base value" in the SHAP Decision Plots documentation shows how to sum SHAP values to match the model output for a LightGBM model. You can use the same approach for any other model. If the summed SHAP values don't match the model output, it's not a plotting issue. The code example is copied below. Both lines should print the same value.
# The model's raw prediction for the first observation.
print(model.predict(features.iloc[[0]].values, raw_score=True)[0].round(4))
# The corresponding sum of the mean + shap values
print((expected_value + shap_values[0].sum()).round(4))
I apologize in advance for the simplicity of this question. I have no background in stats and am getting lost in the complexity of it all.
If I have a couple thousand numbers all with a binary outcome
number,outcome
14,0
27,1
88,1
04,0
42,1
How do I predict future numbers ? for example:
82
45
02
Or is this going to be inaccurate due to there only being a single variable ? All the examples I have seen use multiple variables.
I have been digging through statsmodels and went through this great tutorial: http://blog.yhathq.com/posts/logistic-regression-and-python.html. And through that I've made this:
import pandas as pd
import statsmodels.api as sm
df = pd.read_csv("binary.csv")
df.columns = ["number", "outcome"]
data = df[['number', 'outcome']]
train_cols = data.columns[0]
logit = sm.Logit(data['outcome'], data[train_cols])
result = logit.fit()
print result.summary()
But that seems to be analyzing the weight of current numbers, how would you predict new ones ? Am I even going about this the right way ?
The result of the fit should have a method predict(). That is what you need to use to predict future values, for example:
result = sm.Logit(outcomes, values).fit()
result.predict([82,45,2])