`I get error "Expected 2D array, got 1D array instead:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I think my problem is inverse_transform`
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Pos.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
print(X)
print(y)
y = y.reshape(len(y),1)
print(y)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_y = StandardScaler()
X = sc_X.fit_transform(X)
y = sc_y.fit_transform(y)
print(X)
print(y)
# Training the SVR model on the whole dataset
from sklearn.svm import SVR
regressor = SVR(kernel = 'rbf')
regressor.fit(X, y)
# Predicting a new result
sc_y.inverse_transform(regressor.predict(sc_X.transform([[6.5]])))
# Visualising the SVR results
plt.scatter(sc_X.inverse_transform(X), sc_y.inverse_transform(y), color = 'red')
plt.plot(sc_X.inverse_transform(X), sc_y.inverse_transform(regressor.predict(X)), color = 'blue')
plt.title('Truth or Bluff (SVR)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
okay, so fist of all create a sample CSV as per the link.
Then push this into a dataframe.
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
csv_file = 'G:\\MyDrive\\path\\to\\test_output.csv'
dataset = pd.read_csv(csv_file)
# the dataset
print(dataset)
at this point inspect the data: print(dataset)
The data should look like this:
position level salary
0 Business Analyst 1 45000
2 Senior Consultant 3 60000
3 Manager 4 80000
4 Country Manager 5 110000
5 Region Manager 6 150000
6 Partner 7 200000
7 Senior Partner 8 300000
8 C-level 9 500000
9 CEO 10 1000000
The code in the question then tries to create two lists: X and y.
It should use the native dataframe which is more efficient and easier. So X and y will be this:
my_x = [i+1 for i in range(len(dataset))]
my_y = dataset['salary'].values
...this produces have 2 lists:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
[ 45000 50000 60000 80000 110000 150000 200000 300000 500000
1000000]
This is the answer so far.
So the new question is what do you want to achieve with the reshape line given there are two (same length) lists ?
Related
I have points with x and y coordinates I want to fit a straight line to with Linear Regression but I get a jagged looking line.
I am attemting to use LinearRegression from sklearn.
To create the points run a for loop that randomly crates one hundred points into an array that is 100 x 2 in shape. I slice the left side of it for the xs and the right side of it for the ys.
I expect to have a straight line when I print m.predict.
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression
X = []
adder = 0
for z in range(100):
r = random.random() * 20
r2 = random.random() * 15
X.append([r+adder-0.4, r2+adder])
adder += 0.6
X = np.array(X)
plt.scatter(X[:,0], X[:,1], s=10)
plt.show()
m = LinearRegression()
m.fit(X[:,0].reshape(1, -1), X[:,1].reshape(1, -1))
plt.plot(m.predict(X[:,0].reshape(1, -1))[0])
I am not good with numpy but, I think it is because the use of reshape() function to convert X[:,0] and X[:,1] from 1D to 2D, the resulting 2D array contains only one element, instead of creating a 2D array of len(X[:,0]) and len(X[:,1]) respectively. And resulting into an undesired regressor.
I am able to recreate this model using pandas and able to plot the desired result. Code as follows
import numpy as np
import matplotlib.pyplot as plt
import random
from sklearn.linear_model import LinearRegression
import pandas as pd
X = []
adder = 0
for z in range(100):
r = random.random() * 20
r2 = random.random() * 15
X.append([r+adder-0.4, r2+adder])
adder += 0.6
X = np.array(X)
y_train = pd.DataFrame(X[:,1],columns=['y'])
X_train = pd.DataFrame(X[:,0],columns=['X'])
//plt.scatter(X_train, y_train, s=10)
//plt.show()
m = LinearRegression()
m.fit(X_train, y_train)
plt.scatter(X_train,y_train)
plt.plot(X_train,m.predict(X_train),color='red')
my code is as followed:
transform scale
X = dataset #(100, 18)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print(scaled_series.head())
invert transform
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print(inverted_series.head())
the problem is that scaled_series and inverted_series are the same result, how should I correct the code?
I guess the problem is specific to your dataset. For instance, when I use an example dataset, the scaled_seriesand the inverted_series gave two different outputs:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 0.698347
1 0.706612
2 0.706612
3 0.657025
4 0.797521
dtype: float32
Both scaled_series and inverted_series gave different outputs but the values are close to each other. If you scale your data before using MinMaxScalar:
from sklearn.preprocessing import scale
X = scale(X)
Result:
Scaled Series output:
0 0.729412
1 0.741176
2 0.741176
3 0.670588
4 0.870588
dtype: float32
Inverted Series output:
0 -0.188240
1 -0.123413
2 -0.123413
3 -0.512372
4 0.589678
dtype: float32
Now, the outputs are not close to each other, they are completely different.
Code:
from sklearn.datasets import fetch_olivetti_faces
from sklearn.preprocessing import MinMaxScaler, scale
from pandas import Series
X, _ = fetch_olivetti_faces(return_X_y=True)
X = scale(X)
scaler = MinMaxScaler(feature_range=(0, 1))
scaler = scaler.fit(X)
scaled_X = scaler.transform(X)
scaled_series = Series(scaled_X[:, 17])
print("\nScaled Series output:")
print(scaled_series.head())
inverted_X = scaler.inverse_transform(scaled_X)
inverted_series = Series(inverted_X[:, 17])
print("\nInverted Series output:")
print(inverted_series.head())
You have to consider the range of your dataset X. If we consider the formula for the MinMax scaler:
Should the range of X be [0,1], there will be no difference made as you will be subtracting 0 and dividing by 1. Thus, returning the same value.
Normalization is only viable for values which are not on the scale of 0-1.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a data frame named df:
import pandas as pd
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
x is only for plotting purposes.
I'm trying to predict the y value based on the p values. I am using SVR from sklearn:
from sklearn.svm import SVR
nlm = SVR(kernel='poly').fit(df[['p']], df['y'])
df['nml'] = nlm.predict(df[['p']])
I have already tried all of kernels but it still doesn't work correct enough.
p x y nml
0 15 0 666.666667 524.669572
1 14 1 714.285714 713.042459
2 13 2 769.230769 876.338765
3 12 3 833.333333 1016.349674
Do you know which sklearn model or other libraries should I use to better fit a model?
You missed the fundamental step "normalize the data"
Fix
df = pd.DataFrame({'p': [15-x for x in range(14)]
, 'x': [x for x in range(14)]})
df['y'] = 1000 * (10 / df['p'])
# Normalize the data (x - mean(x))/std(x)
s_p = np.std(df['p'])
m_p = np.mean(df['p'])
s_y = np.std(df['y'])
m_y = np.mean(df['y'])
df['p_'] = (df['p'] - s_p)/m_p
df['y_'] = (df['y'] - s_y)/m_y
# Fit and make prediction
nlm = SVR(kernel='rbf').fit(df[['p_']], df['y_'])
df['nml'] = nlm.predict(df[['p_']])
# Plot
plt.plot(df['p_'], df['y_'], 'r')
plt.plot(df['p_'], df['nml'], 'g')
plt.show()
# Rescale back and plot
plt.plot(df['p_']*s_p+m_p, df['y_']*s_y+m_y, 'r')
plt.plot(df['p_']*s_p+m_p, df['nml']*s_y+m_y, 'g')
plt.show()
As #mujjiga pointed out, scaling is important part of the process.
I would like to draw your attention on another two key points:
model selection which determines your ability to solve a class of problem;
new scklearn API which helps you to standardize solution development.
Let's start with your dataset:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
x = np.arange(14)
df = pd.DataFrame({'x': x, 'p': 15-x})
df['y'] = 1e4/df['p']
Then we import somesklearn API objects of interest:
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, RobustScaler, FunctionTransformer
First we create a scaler function for target values:
ysc = StandardScaler()
Notice that we can use different scalers, or build a custom transformation.
# Scaler robust against outliers:
ysc = RobustScaler()
# Logarithmic Transformation:
ysc = FunctionTransformer(func=np.log, inverse_func=np.exp, check_inverse=True)
We scale target using the scaler of our choice:
ysc.fit(df[['y']])
df['yn'] = ysc.transform(df[['y']])
We also build a pipeline with features standardizer and the selected model (we adjusted parameters to improve the fit). We fit it to your dataset using the pipeline:
reg = make_pipeline(StandardScaler(), SVR(kernel='rbf', C=1e3, epsilon=1e-3))
reg.fit(df[['p']], df['yn'])
At this point we can predict values and transform them back to the original scale:
df['ynhat'] = reg.predict(df[['p']])
df['yhat'] = ysc.inverse_transform(df[['ynhat']])
We check the fit score:
reg.score(df[['p']], df['yn']) # 0.9999646718755011
We can also compute absolute and relative error for each point:
df['yaerr'] = df['yhat'] - df['y']
df['yrerr'] = df['yaerr']/df['y']
Final result is:
x p y yn ynhat yhat yaerr yrerr
0 0 15 666.666667 -0.834823 -0.833633 668.077018 1.410352 0.002116
1 1 14 714.285714 -0.794636 -0.795247 713.562403 -0.723312 -0.001013
2 2 13 769.230769 -0.748267 -0.749627 767.619013 -1.611756 -0.002095
3 3 12 833.333333 -0.694169 -0.693498 834.128425 0.795091 0.000954
4 4 11 909.090909 -0.630235 -0.629048 910.497550 1.406641 0.001547
5 5 10 1000.000000 -0.553514 -0.555029 998.204445 -1.795555 -0.001796
6 6 9 1111.111111 -0.459744 -0.460002 1110.805275 -0.305836 -0.000275
7 7 8 1250.000000 -0.342532 -0.341099 1251.697707 1.697707 0.001358
8 8 7 1428.571429 -0.191830 -0.193295 1426.835676 -1.735753 -0.001215
9 9 6 1666.666667 0.009105 0.010458 1668.269984 1.603317 0.000962
10 10 5 2000.000000 0.290414 0.291060 2000.764717 0.764717 0.000382
11 11 4 2500.000000 0.712379 0.690511 2474.088446 -25.911554 -0.010365
12 12 3 3333.333333 1.415652 1.416874 3334.780642 1.447309 0.000434
13 13 2 5000.000000 2.822199 2.821420 4999.076799 -0.923201 -0.000185
Graphically it leads to:
fig, axe = plt.subplots()
axe.plot(df['p'], df['y'], label='$y(p)$')
axe.plot(df['p'], df['yhat'], 'o', label='$\hat{y}(p)$')
axe.set_title(r"SVR Fit for $y(x) = \frac{k}{x-a}$")
axe.set_xlabel('$p = x-a$')
axe.set_ylabel('$y, \hat{y}$')
axe.legend()
axe.grid()
Linearization
In the example above we could not use the poly kernel, we had to use the rbf kernel instead. This is because if we aim to fit a rational function using polynomial we are better to transform our data before fitting using a p = x/(x-b) substitution at the first place. In this case it will merely boil down to perform a linear regression. The example below shows that it works:
Scaler and transformation can be composed into a pipeline as well. We define a pipeline that linearize and scale the problem:
# Rational Fraction Substitution with consecutive Standardization
ysc = make_pipeline(
FunctionTransformer(func=lambda x: x/(x+1),
inverse_func=lambda x: x/(1-x),
check_inverse=True),
StandardScaler()
)
Then we can regress the data using classical OLS:
reg = make_pipeline(StandardScaler(), LinearRegression())
reg.fit(df[['p']], df['yn'])
Which provides correct result:
reg.score(df[['p']], df['yn']) # 0.9999998722172933
This second solution take advantage of a known linearization and thus remove the need to parametrize the model.
I try to run a t-sne but python shows me this error:
IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Data is being provided by this link.
Here's the code:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
#Step 1 - Download the data
dataframe_all = pd.read_csv('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv')
num_rows = dataframe_all.shape[0]
#Step 2 - Clearn the data
#count the number of missing elements (NaN) in each column
counter_nan = dataframe_all.isnull().sum()
counter_without_nan = counter_nan[counter_nan==0]
#remove the columns with missing elements
dataframe_all = dataframe_all[counter_without_nan.keys()]
#remove the first 7 columns which contain no descriminative information
dataframe_all = dataframe_all.ix[:,7:]
#Step 3: Create feature vectors
x = dataframe_all.ix[:,:-1].values
standard_scalar = StandardScaler()
x_std = standard_scalar.fit_transform(x)
# t distributed stochastic neighbour embedding (t-SNE) visualization
tsne = TSNE(n_components=2, random_state = 0)
x_test_2d = tsne.fit_transform(x_std)
#scatter plot the sample points among 5 classes
markers=('s','d','o','^','v')
color_map = {0:'red', 1:'blue', 2:'lightgreen', 3:'purple', 4:'cyan'}
plt.figure()
for idx, cl in enumerate(np.unique(x_test_2d)):
plt.scatter(x=x_test_2d[cl, 0],y =x_test_2d[cl, 1], c=color_map[idx], marker=markers[idx], label=cl)
plt.show()
What do I have to change in order to make this work?
The error is due to the following line:
plt.scatter(x_test_2d[cl, 0], x_test_2d[cl, 1], c=color_map[idx], marker=markers[idx])
Here, cl can take and takes not integer values (from np.unique(x_test_2d)) and this raises the error, e.g. the last value that cl takes is 99.46295 and then you use: x_test_2d[cl, 0] which translates into x_test_2d[99.46295, 0]
Define a variable y that hold the class labels, then use:
# variable holding the classes
y = dataframe_all.classe.values
y = np.array([ord(i) for i in y])
#scatter plot the sample points among 5 classes
plt.figure()
plt.scatter(x_test_2d[:, 0], x_test_2d[:, 1], c = y)
plt.show()
FULL CODE:
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.manifold import TSNE
#Step 1 - Download the data
dataframe_all = pd.read_csv('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv')
num_rows = dataframe_all.shape[0]
#Step 2 - Clearn the data
#count the number of missing elements (NaN) in each column
counter_nan = dataframe_all.isnull().sum()
counter_without_nan = counter_nan[counter_nan==0]
#remove the columns with missing elements
dataframe_all = dataframe_all[counter_without_nan.keys()]
#remove the first 7 columns which contain no descriminative information
dataframe_all = dataframe_all.ix[:,7:]
#Step 3: Create feature vectors
x = dataframe_all.ix[:,:-1].values
standard_scalar = StandardScaler()
x_std = standard_scalar.fit_transform(x)
# t distributed stochastic neighbour embedding (t-SNE) visualization
tsne = TSNE(n_components=2, random_state = 0)
x_test_2d = tsne.fit_transform(x_std)
# variable holding the classes
y = dataframe_all.classe.values # you need this for the colors
y = np.array([ord(i) for i in y]) # convert letters to numbers
#scatter plot the sample points among 5 classes
plt.figure()
plt.scatter(x_test_2d[:, 0], x_test_2d[:, 1], c = y)
plt.show()
I have done EDA on this white whine dataset and I am trying to find 3 predictors of quality and conduct linear regression on them.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
wine = "~/Desktop/datasets/winequality-white.csv"
# Load the data
df = pd.read_csv(wine,sep=";")
df.head()
# Look at the information regarding its columns.
df.info()
# non-null floats also validated by √null_release_mask = df['fixed
acidity'].isnull()
I'm trying to Do a train-test split and choose 3 predictors to predict quality
from sklearn.model_selection import train_test_split
X = df[["alcohol", "pH","free sulfur dioxide"]]
y = df["quality"]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.3, random_state=42)
print(len(X_train), len(X_test))
print(len(y_train), len(y_test))`
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train,y_train)
import numpy as np
x_values_to_plot = np.linspace(0, df[["alcohol", "pH","free sulfur
dioxide"]].max(), 15)
y_values_to_plot = (x_values_to_plot * model.coef_) + model.intercept_
fig, ax = plt.subplots(figsize=(6,6))
ax.scatter(df[["alcohol", "pH","free sulfur dioxide"]], df["quality"],
label="data", alpha=0.2)
ax.plot(x_values_to_plot, y_values_to_plot, label="regression_line of
white wines", c="r")
ax.legend(loc="best")
plt.show()
However I get this error :
---------------------------------------------------------------------------
ValueError Traceback (most recent call
last)
<ipython-input-68-c52d735932ab> in <module>()
1 import numpy as np
2
----> 3 x_values_to_plot = np.linspace(0, df[["alcohol", "pH","free
sulfur dioxide"]].max(), 15)
4 y_values_to_plot = (x_values_to_plot * model.coef_) +
model.intercept_
5
~/anaconda3/lib/python3.7/site-packages/numpy/core/function_base.py in
linspace(start, stop, num, endpoint, retstep, dtype)
122 if num > 1:
123 step = delta / div
--> 124 if step == 0:
125 # Special handling for denormal numbers, gh-5437
126 y /= div
*ValueError: The truth value of an array with more than one element
is
ambiguous. Use a.any() or a.all()*
Any help would be greatly appreciated, I am new to StackOverflow so have mercy over the format of question & let me know on what I can improve. Thanks
This particular error pertains to this snippet
x_values_to_plot = np.linspace(0, df[["alcohol", "pH","free sulfur dioxide"]].max(), 15)
Since
df[["alcohol", "pH","free sulfur dioxide"]].max()
Will return three values, the max for alcohol, pH and free SO2. You can fix this by adding another .max() which will select the max of those three max values, assuming this is what you are trying to do.
You have a few other issues with the section below your regression model. What exactly do you want to present at the end? You could always try using seaborn, which is good for these types of visualisation.