Python OptBinning package's OptimalBinning and BinningProcess giving different results sometimes - python

I'm using the OptBinning package to bin some numeric data. I'm following this example to do this. And from this tutorial I read that "... the best way to view BinningProcess is as a wrapper for OptimalBinning", which implies that they should both give the same outputs. However, I'm seeing they give different outputs for some features and the same for others. Why is this the case? Below is an example showing how the two methods lead to the same output for 'mean radius' but not 'worst radius' using the breast cancer data in sklearn.
import pandas as pd
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from optbinning import BinningProcess
from optbinning import OptimalBinning
# Load data
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
# Bin 'mean radius' data using OptimalBinning method
var = 'mean radius'
x = df[var]
y = data.target
optb = OptimalBinning(name=var, dtype="numerical")
optb.fit(x, y)
binning_table = optb.binning_table
binning_table.build()['WoE']
0 -3.12517
1 -2.71097
2 -1.64381
3 -0.839827
4 -0.153979
5 2.00275
6 5.28332
7 0
8 0
Totals
Name: WoE, dtype: object
# Bin 'mean radius' using BinningProcess method
var = ['mean radius']
bc_pipe = Pipeline([('WOE Binning', BinningProcess(variable_names=var))])
preprocessor = ColumnTransformer([('Numeric Pipeline', bc_pipe, var)], remainder='passthrough')
preprocessor.fit(df, y)
df_processed = preprocessor.transform(df)
df_processed = pd.DataFrame(df_processed, columns=df.columns)
df_processed[var[0]].unique()
array([ 5.28332344, -3.12517033, -1.64381421, -0.15397917, 2.00275405,
-0.83982705, -2.71097154])
## We see that the Weight of Evidence (WoE) values are the same for 'mean radius' using both methods (except for the 0's, which we can ignore for now)
# Bin 'worst radius' using OptimalBinning process
var = 'worst radius'
x = df[var]
y = data.target
optb = OptimalBinning(name=var, dtype="numerical")
optb.fit(x, y)
binning_table = optb.binning_table
binning_table.build()['WoE']
0 -4.56645
1 -2.6569
2 -0.800606
3 -0.060772
4 1.61976
5 5.5251
6 0
7 0
Totals
Name: WoE, dtype: object
# Bin 'worst radius' using BinningProcess method
var = ['worst radius']
bc_pipe = Pipeline([('WOE Binning', BinningProcess(variable_names=var))])
preprocessor = ColumnTransformer([('Numeric Pipeline', bc_pipe, var)], remainder='passthrough')
preprocessor.fit(df, y)
df_processed = preprocessor.transform(df)
df_processed = pd.DataFrame(df_processed, columns=df.columns)
df_processed[var[0]].unique()
array([0.006193 , 0.003532 , 0.004571 , 0.009208 , 0.005115 , 0.005082 ,
0.002179 , 0.005412 , 0.003749 , 0.01008 , 0.003042 , 0.004144 ,
0.01284 , 0.003002 , 0.008093 , 0.005466 , 0.002085 , 0.004142 ,
0.001997 , 0.0023 , 0.002425 , 0.002968 , 0.004394 , 0.001987 ,
0.002801 , 0.007444 , 0.003711 , 0.004217 , 0.002967 , 0.003742 ,
0.00456 , 0.005667 , 0.003854 , 0.003896 , 0.003817 , ... ])
## We now see that for 'worst radius' the two WoE's are not the same. Why?

I think the problem is due to the default behaviour of ColumnTransformer option remainder="passthrough". The remaining columns are concatenated, and that's why the position of the transformed variables changes. If you look at the dataframe, the first column contains the WoE values of the feature "worst radius". As an example, please try the following:
binning_process = BinningProcess(variable_names=var)
binning_process.fit(df[var], y)
np.unique(binning_process.transform(df[var]).values)
The binning process, as expected, will return the same WoE values. See also: https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html
By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers.

Related

How to feed a nested array into an SVM model

my question is the following:
I have an array that has feature vectors that correspond to several audio files. So if for example there are 10 audio files than this array would have length 10.
I have a feature that is itself a list (this list comprises the information of a specific feature of the audio file) and for a given audio file the feature vector looks like this:
array([0.03861840871664194, 187.72393405210002, 62.59881268743305,
0.2911392405063291,
array([4963.40332031, 3229.98046875, 2691.65039062, 3208.44726562,
4338.94042969, 4220.5078125 , 4166.67480469, 4801.90429688,
5555.56640625, 5910.86425781, 6115.4296875 , 5706.29882812,
4984.93652344, 2756.25 , 1991.82128906, 2551.68457031,
2734.71679688, 2906.98242188, 3143.84765625, 3219.21386719,
3186.9140625 , 3165.38085938, 3068.48144531, 2465.55175781,
2110.25390625, 2508.61816406, 2993.11523438, 3843.67675781,
4715.77148438, 5652.46582031, 5480.20019531, 5792.43164062,
5932.39746094, 6244.62890625, 6072.36328125, 6201.5625 ,
6158.49609375, 6201.5625 , 6233.86230469, 6061.59667969])],
dtype=object)
Now when I try to feed this data into the svm model:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
model = svm.SVC()
model.fit(X_train,y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)
I get this error ValueError: setting an array element with a sequence.
How can I structure my feature vector in order to be able to feed it to the svm?
EDIT:
Here I provide with an example of X
if we have 5 audio files then X will be:
array([[0.017455393927437918, 227.66237105624407, 32.42076654734572,
0.3867924528301887,
array([1851.85546875, 2433.25195312, 3057.71484375, 3079.24804688,
3079.24804688, 3068.48144531, 3046.94824219, 3359.1796875 ,
3908.27636719, 4618.87207031, 4618.87207031, 4521.97265625,
4091.30859375, 3111.54785156, 3100.78125 , 2863.91601562,
1561.15722656, 1119.7265625 , 1065.89355469, 947.4609375 ,
979.76074219, 990.52734375, 990.52734375, 1356.59179688,
2077.95410156, 2993.11523438, 3025.41503906, 3068.48144531,
3079.24804688, 3090.01464844, 3100.78125 , 3111.54785156,
2993.11523438, 3100.78125 , 3079.24804688, 2853.14941406,
1205.859375 , 1281.22558594, 1614.99023438, 2131.78710938,
2325.5859375 , 2034.88769531, 1916.45507812, 1744.18945312,
1851.85546875, 2357.88574219, 2368.65234375, 1916.45507812,
1959.52148438, 1959.52148438, 1754.95605469, 1787.25585938,
2207.15332031])],
[0.03861840871664194, 187.72393405210002, 62.59881268743305,
0.2911392405063291,
array([4963.40332031, 3229.98046875, 2691.65039062, 3208.44726562,
4338.94042969, 4220.5078125 , 4166.67480469, 4801.90429688,
5555.56640625, 5910.86425781, 6115.4296875 , 5706.29882812,
4984.93652344, 2756.25 , 1991.82128906, 2551.68457031,
2734.71679688, 2906.98242188, 3143.84765625, 3219.21386719,
3186.9140625 , 3165.38085938, 3068.48144531, 2465.55175781,
2110.25390625, 2508.61816406, 2993.11523438, 3843.67675781,
4715.77148438, 5652.46582031, 5480.20019531, 5792.43164062,
5932.39746094, 6244.62890625, 6072.36328125, 6201.5625 ,
6158.49609375, 6201.5625 , 6233.86230469, 6061.59667969])],
[0.042435441297643324, 128.81225073038124, 20.912528554426807,
0.313953488372093,
array([4349.70703125, 4242.04101562, 4274.34082031, 4123.60839844,
4457.37304688, 4834.20410156, 4661.93847656, 4306.640625 ,
4231.27441406, 4543.50585938, 4435.83984375, 6201.5625 ,
8817.84667969, 8817.84667969, 742.89550781, 721.36230469,
732.12890625, 732.12890625, 710.59570312, 721.36230469,
925.92773438, 1119.7265625 , 1141.25976562, 1431.95800781,
7762.71972656, 7934.98535156, 7891.91894531, 7332.05566406,
3789.84375 , 2799.31640625, 2831.61621094, 2217.91992188,
581.39648438, 602.9296875 , 2217.91992188, 2228.68652344,
2368.65234375, 2519.38476562, 2863.91601562, 3682.17773438,
3649.87792969, 4188.20800781, 4112.84179688])],
[0.006295381642571726, 130.28309914454434, 5.193614287487564,
0.2411764705882353,
array([7978.05175781, 8010.3515625 , 8118.01757812, 8430.24902344,
8257.98339844, 8451.78222656, 8591.74804688, 8677.88085938,
8796.31347656, 8850.14648438, 8796.31347656, 8925.51269531,
6244.62890625, 344.53125 , 344.53125 , 1614.99023438,
2325.5859375 , 2971.58203125, 3316.11328125, 3617.578125 ,
3294.58007812, 2788.54980469, 2637.81738281, 2702.41699219,
2723.95019531, 3133.08105469, 3413.01269531, 5663.23242188,
5770.8984375 , 5577.09960938, 2228.68652344, 1604.22363281,
1690.35644531, 4123.60839844, 5566.33300781, 5803.19824219,
5749.36523438, 5846.26464844, 6772.19238281, 7073.65722656,
7622.75390625, 7859.61914062, 8236.45019531, 8441.015625 ,
8699.4140625 , 8807.08007812, 8742.48046875, 8667.11425781,
8710.18066406, 8947.04589844, 9140.84472656, 9130.078125 ,
8936.27929688, 8925.51269531, 8947.04589844, 8925.51269531,
9097.77832031, 9205.44433594, 9194.67773438, 9140.84472656,
9162.37792969, 9043.9453125 , 9162.37792969, 9108.54492188,
9183.91113281, 9280.81054688, 9270.04394531, 9108.54492188,
9076.24511719, 9356.17675781, 9226.97753906, 9216.2109375 ,
9248.51074219, 9140.84472656, 9237.74414062, 9334.64355469,
9259.27734375, 9226.97753906, 9216.2109375 , 9108.54492188,
9183.91113281, 9216.2109375 , 9248.51074219, 9259.27734375,
9183.91113281])],
[0.017070271599460656, 171.91660927761163, 26.854424936811768,
0.11188811188811189,
array([4715.77148438, 4629.63867188, 4898.80371094, 5275.63476562,
4941.87011719, 4532.73925781, 4618.87207031, 4995.703125 ,
4705.00488281, 4500.43945312, 4188.20800781, 4371.24023438,
4457.37304688, 4188.20800781, 4909.5703125 , 4877.27050781,
6761.42578125, 7708.88671875, 7719.65332031, 7956.51855469,
8484.08203125, 9033.17871094, 9043.9453125 , 9000.87890625,
9011.64550781, 9011.64550781, 9000.87890625, 9108.54492188,
8817.84667969, 6686.05957031, 1808.7890625 , 1830.32226562,
1851.85546875, 1636.5234375 , 1022.82714844, 1281.22558594,
1927.22167969, 1948.75488281, 1302.75878906, 1399.65820312,
1873.38867188, 1959.52148438, 7245.92285156, 9011.64550781,
9420.77636719, 9549.97558594, 9453.07617188, 9431.54296875,
9410.00976562, 9248.51074219, 9151.61132812, 9194.67773438,
8968.57910156, 8634.81445312, 8268.75 , 7439.72167969,
5501.73339844, 5232.56835938, 5103.36914062, 7052.12402344,
7299.75585938, 7127.49023438, 7192.08984375, 5673.99902344,
5523.26660156, 5986.23046875, 6729.12597656, 6309.22851562,
5135.66894531, 5081.8359375 , 5329.46777344, 5404.83398438])]],
dtype=object)
You can feed the feature with the lists inside to your model in two ways:
Treat the list as additional features
Map all of its elements into a single number with a function you deem appropriate (min, median, mean, max, sum, etc.).
To try the first option:
# Convert `X` to data frame
X = pd.DataFrame(X)
# Rename columns
X.columns = ['feature_' + str(i + 1) for i in range(X.shape[1])]
# Convert the feature with lists inside to long format
x = X['feature_5'].explode().to_frame()
# Create counter by observation so we can pivot
x['observation_id'] = x.groupby(level=0).cumcount()
# Convert to dataset and rename all columns
x = x.pivot(columns='observation_id', values='feature_5').fillna(0)
x = x.add_prefix('list_element_')
# Drop `feature_5` from X
X.drop(columns='feature_5', axis=1, inplace=True)
# Concatenate X and x together
X = pd.concat([X, x], axis=1)
# Carry on as before
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
model = svm.SVC()
model.fit(X_train,y_train)
There's no right answer to the second option and only you can decide how to do this because only you know what the lists mean. However, if you want to get the mean (for example) of each list and use that as a feature:
# Get the mean of each list
means = [np.mean(array) for array in X[:, 4]]
# Replace the lists with `means`
X[:, 4] = means
And then carry on with the splitting and fitting.

ValueError: x and y must be the same size In Python while creating KMeans Model

I'm building a Kmeans clustering model with a churn dataset and am getting an error that says ValueError: x and y must be the same size when trying to create cluster graph.
I'll post both my function and the graph code here in a sec, but in trying to narrow it down, I think it may have something to do with this line of code in the function:
x=kmeans.cluster_centers_[:,0]
, y=kmeans.cluster_centers_[:,1]
Here's the full code
def Create_kmeans_cluster_graph(df_final, data, n_clusters, x_title, y_title, chart_title):
""" Display K-means cluster based on data """
kmeans = KMeans(n_clusters=n_clusters # No of cluster in data
, random_state = random_state # Selecting same training data
)
kmeans.fit(data)
kmean_colors = [plotColor[c] for c in kmeans.labels_]
fig = plt.figure(figsize=(12,8))
plt.scatter(x= x_title + '_norm'
, y= y_title + '_norm'
, data=data
, color=kmean_colors # color of data points
, alpha=0.25 # transparancy of data points
)
plt.xlabel(x_title)
plt.ylabel(y_title)
plt.scatter(x=kmeans.cluster_centers_[:,0]
, y=kmeans.cluster_centers_[:,1]
, color='black'
, marker='X' # Marker sign for data points
, s=100 # marker size
)
plt.title(chart_title,fontsize=15)
plt.show()
return kmeans.fit_predict(df_final[df_final.Churn==1][[x_title+'_norm', y_title +'_norm']])
//Graph
df_final['Cluster'] = -1 # by default set Cluster to -1
df_final.iloc[(df_final.Churn==1),'Cluster'] = Create_kmeans_cluster_graph(df_final
,df_final[df_final.Churn==1][['Tenure_norm','MonthlyCharge_norm']]
,3
,'Tenure'
,'MonthlyCharges'
,"Tenure vs Monthlycharges : Churn customer cluster")
df_final['Cluster'].unique()
You get that error because of this line:
plt.scatter(x= x_title + '_norm'
, y= y_title + '_norm'
, data=data
, color=kmean_colors # color of data points
, alpha=0.25 # transparancy of data points
)
If you use plt.scatter, it does not take in data= as an argument, you can read the help page. You can either do:
plt.scatter(data[x_title + '_norm'],data[y_title + '_norm'],...)
Or you use the plot.scatter method on a pandas dataframe, which I did in a edited version of your function:
def Create_kmeans_cluster_graph(df_final, data, n_clusters, x_title, y_title, chart_title):
plotColor = ['k','g','b']
kmeans = KMeans(n_clusters=n_clusters , random_state = random_state)
kmeans.fit(data)
kmean_colors = [plotColor[c] for c in kmeans.labels_]
data.plot.scatter(x= x_title + '_norm', y= y_title + '_norm',
color=kmean_colors,alpha=0.25)
plt.xlabel(x_title)
plt.ylabel(y_title)
plt.scatter(x=kmeans.cluster_centers_[:,0],y=kmeans.cluster_centers_[:,1],
color='black',marker='X',s=100)
return kmeans.labels_
On an example dataset, it works:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
random_state = 42
np.random.seed(42)
df_final = pd.DataFrame({'Tenure_norm':np.random.uniform(0,1,50),
'MonthlyCharge_norm':np.random.uniform(0,1,50),
'Churn':np.random.randint(0,3,50)})
Create_kmeans_cluster_graph(df_final
,df_final[df_final.Churn==1][['Tenure_norm','MonthlyCharge_norm']]
,3
,'Tenure'
,'MonthlyCharge'
,"Tenure vs Monthlycharges : Churn customer cluster")

Minmaxscaler inverse_transform doesn't work

my code is as followed:
transform scale
X = dataset #(100, 18)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print(scaled_series.head())
invert transform
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print(inverted_series.head())
the problem is that scaled_series and inverted_series are the same result, how should I correct the code?
I guess the problem is specific to your dataset. For instance, when I use an example dataset, the scaled_seriesand the inverted_series gave two different outputs:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 0.698347
1 0.706612
2 0.706612
3 0.657025
4 0.797521
dtype: float32
Both scaled_series and inverted_series gave different outputs but the values are close to each other. If you scale your data before using MinMaxScalar:
from sklearn.preprocessing import scale
X = scale(X)
Result:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 -0.188240
1 -0.123413
2 -0.123413
3 -0.512372
4 0.589678
dtype: float32
Now, the outputs are not close to each other, they are completely different.
Code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.preprocessing import MinMaxScaler, scale
from pandas import Series
X, _ = fetch_olivetti_faces(return_X_y=True)
X = scale(X)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print("\nScaled Series output:")
print(scaled_series.head())
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print("\nInverted Series output:")
print(inverted_series.head())
You have to consider the range of your dataset X. If we consider the formula for the MinMax scaler:
Should the range of X be [0,1], there will be no difference made as you will be subtracting 0 and dividing by 1. Thus, returning the same value.
Normalization is only viable for values which are not on the scale of 0-1.

Random Forest Classifier ValueError: Input contains NaN, infinity or a value too large for dtype('float32')

I'm trying to apply the RandomForest method to a dataset and I get this error:
ValueError: Input contains NaN, infinity or a value too large for dtype ('float32')
Could someone tell me what I can modify in the function for the code to work:
def ranks_RF(x_train, y_train, features_train, RESULT_PATH='Results'):
"""Get ranks from Random Forest"""
print("\nMétodo_Random_Forest")
random_forest = RandomForestRegressor(n_estimators=10)
np.nan_to_num(x_train)
np.nan_to_num(y_train)
random_forest.fit(x_train, y_train)
# Get rank by doing two times a sort.
imp_array = np.array(random_forest.feature_importances_)
imp_order = imp_array.argsort()
ranks = imp_order.argsort()
# Plot Random Forest
imp = pd.Series(random_forest.feature_importances_, index=x_train.columns)
imp = imp.sort_values()
imp.plot(kind="barh")
plt.xlabel("Importance")
plt.ylabel("Features")
plt.title("Feature importance using Random Forest")
# plt.show()
plt.savefig(RESULT_PATH + '/ranks_RF.png', bbox_inches='tight')
return ranks
You did not overwrite the values when you replaced the nan, hence it's giving you the errors.
We try an example dataset:
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import load_iris
iris = load_iris()
df = pd.DataFrame(data= iris['data'],
columns= iris['feature_names'] )
df['target'] = iris['target']
# insert some NAs
df = df.mask(np.random.random(df.shape) < .1)
We have a function like yours, I removed the plotting part, because that's another question altogether:
def ranks_RF(x_train, y_train):
var_names = x_train.columns
random_forest = RandomForestRegressor(n_estimators=10)
# here you have to reassign back the values
x_train = np.nan_to_num(x_train)
y_train = np.nan_to_num(y_train)
random_forest.fit(x_train, y_train)
res = pd.DataFrame({
"features":var_names,
"importance":random_forest.feature_importances_,
})
res = res.sort_values(['importance'],ascending=False)
res['rank'] = np.arange(len(res))+1
return res
We run it:
ranks_RF(df.iloc[:,0:4],df['target'])
features importance rank
3 petal width (cm) 0.601734 1
2 petal length (cm) 0.191613 2
0 sepal length (cm) 0.132212 3
1 sepal width (cm) 0.074442
This worked for me
np.where(x.values >= np.finfo(np.float32).max)
Where x is my pandas Dataframe
Then Convert your DataFrame to Float32 if it's not

ValueError: Unable to coerce to Series, length must be 1: given n

I have been trying to use RF regression from scikit-learn, but I’m getting an error with my standard (from docs and tutorials) model. Here is the code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
db = pd.read_excel('/home/artyom/myprojects//valuevo/field2019/report/segs_inventar_dataframe/excel_var/invcents.xlsx')
age = df[['AGE_1', 'AGE_2', 'AGE_3', 'AGE_4', 'AGE_5']]
hight = df [['HIGHT_','HIGHT_1', 'HIGHT_2', 'HIGHT_3', 'HIGHT_4', 'HIGHT_5']]
diam = df[['DIAM_', 'DIAM_1', 'DIAM_2', 'DIAM_3', 'DIAM_4', 'DIAM_5']]
za = df[['ZAPSYR_', 'ZAPSYR_1', 'ZAPSYR_2', 'ZAPSYR_3', 'ZAPSYR_4', 'ZAPSYR_5']]
tova = df[['TOVARN_', 'TOVARN_1', 'TOVARN_2', 'TOVARN_3', 'TOVARN_4', 'TOVARN_5']]
#df['average'] = df.mean(numeric_only=True, axis=1)
df['meanage'] = age.mean(numeric_only=True, axis=1)
df['meanhight'] = hight.mean(numeric_only=True, axis=1)
df['mediandiam'] = diam.mean(numeric_only=True, axis=1)
df['medianza'] = za.mean(numeric_only=True, axis=1)
df['mediantova'] = tova.mean(numeric_only=True, axis=1)
unite = df[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media','meanhight']].dropna()
from sklearn.model_selection import train_test_split as ttsplit
df_copy = unite.copy()
trainXset = df_copy[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media']]
trainYset = df_copy [['meanhight']]
trainXset_train, trainXset_test, trainYset_train, trainYset_test = ttsplit(trainXset, trainYset, test_size=0.3) # 70% training and 30% test
rf = RandomForestRegressor(n_estimators = 100, random_state = 40)
rf.fit(trainXset_train, trainYset_train)
predictions = rf.predict(trainXset_test)
errors = abs(predictions - trainYset_test)
mape = 100 * (errors / trainYset_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
But output doesn’t look ok:
---> 24 errors = abs(predictions - trainYset_test)
25 # Calculate mean absolute percentage error (MAPE)
26 mape = 100 * (errors / trainYset_test)
..... somemore track
ValueError: Unable to coerce to Series, length must be 1: given 780
How can I fix it? 780 is the shape of trainYset_test. I’m not asking for a solution (i.e. write code for me), but for advice on why this error happened. I followed everything as in tutorials.
by seeing in error it is cleared that, the array has to have the shape of one ,
so use reshape to make it in correct shape,
predictions=predictions.reshape(780,1)
I solved this by making sure the predictions were the same data type as the actual data. In my case, it was:
MSE = (sum((y_test-predictions)**2))/(len(newX)-len(newX.columns))
I resolved this by casting y_test to be a numpy array:
MSE = (sum((np.array(y_test)-predictions)**2))/(len(newX)-len(newX.columns))

Categories

Resources