My line in matplotlib is the correct shape, however, it is made up of zig zagging lines.
I've tried restarting and graphing the same equation on desmos. The equation on desmos looks exactly how I want it to. I think this is a matplotlib issue.
#imports
import numpy as np
import pandas as pd
import seaborn as sns; sns.set() # just makes your plots look prettier run 'pip install seaborn'
import matplotlib.pyplot as plt
from IPython.core.pylabtools import figsize
figsize(15, 7)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.3)
noise = np.random.randn(100)
x = np.linspace(-2,2, 100)
y = x + noise + np.random.randn()*2 + x**2
plt.scatter(x, y); plt.show()
#pre processing
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
#initializing m and b variables
current_z_val = 0.1
current_m_val = 0.1
current_b_val = 0.1
#setting # of iterations
iterations = 5
#calculating length of examples for functions used below
n = len(x_train)
#learning rate
learning_rate = 0.01
#plot the data and estimates
plt.scatter(x_train,y_train)
plt.title("Example data and hypothesis lines")
plt.xlabel('X Axis')
plt.ylabel('Y Axis')
cost_history = []
#main graident descent loop
for i in range(iterations):
#creating the hypothesis using y=z^2 + mx+b form
y_hypothesis = (current_z_val * (x_train**2)) + (current_m_val * x_train) + current_b_val
#calculating the derivatives from the image embedded above in code
z_deriv = -(2/n)*sum(y_train-y_hypothesis)
m_deriv = -(2/n)*sum(x_train*(y_train-y_hypothesis))
b_deriv = -(2/n)*sum(y_train-y_hypothesis)
#updating m and b values
current_z_val = current_z_val - (learning_rate * z_deriv)
current_m_val = current_m_val - (learning_rate * m_deriv)
current_b_val = current_b_val - (learning_rate * b_deriv)
#calculate the cost (error) of the model
cost = (1/n)*sum(y_train-y_hypothesis)**2
cost_history.append(cost)
#print the m and b values
#print("iteration {}, cost {}, m {}, b {}".format(i,cost,current_m_val,current_b_val))
plt.plot(x_train,y_hypothesis)
plt.show()
#plot the final graph
plt.plot(range(1,len(cost_history)+1),cost_history)
plt.title("Cost at each iteration")
plt.xlabel('Iterations')
plt.ylabel('MSE')
plt.show()
This is what a graph looks like on my plot. And this is what it should look like.
matplotlib plots the point following their order in the list, not their "natural" order given by their magnitude.
I think you should sort x_train before computing y_hypothesis in order to get the function you expect to have.
Note that this is happening in both plt.scatter() and plt.plot(), but you see it only in the latter because while connecting the dots with plt.plot() you actually see the sequence.
The function train_test_split will randomly select xtrain and xtest, due to which your x will be shuffled. Matplotlib will not be able to plot a line if your x is not in order.
Use shuffle=False in the following line. That should do that plot right.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, shuffle=False)
Related
Does my code have a bug or something else?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
file = 'https://aegis4048.github.io/downloads/notebooks/sample_data/unconv_MV_v5.csv'
myDF = pd.read_csv(file)
# Split the data into features and target
feature1 = "Brittle"
feature2 = "Por"
X = myDF[[feature1, feature2]] #.iloc[:, :-1].values # A NumPy array!
print("X.info():", X.info())
y = myDF["Prod"] #.iloc[:, -1].values
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# Create a linear regression object
reg = LinearRegression()
# Fit the model to the training data
reg.fit(X_train, y_train)
# Predict the target variable using the test data
y_pred = reg.predict(X_test)
# Evaluate the model using mean squared error (MSE)
mse = np.mean((y_test - y_pred)**2)
print("Mean Squared Error: ", mse)
print("R2 Score:", reg.score(X_test, y_test))
#define figure size in (width, height) for all plots
plt.rcParams['figure.figsize'] = [10, 7]
# Create a mesh of values for the features
print(X_train.shape) # NumPy array
x1_min, x1_max = X_train[feature1].min(), X_train[feature1].max()
x2_min, x2_max = X_train[feature2].min(), X_train[feature2].max()
x1, x2 = np.meshgrid(np.linspace(x1_min, x1_max, 100), np.linspace(x2_min, x2_max, 100))
X_mesh = np.c_[x1.ravel(), x2.ravel()]
# Compute the predictions for the mesh of values
y_pred_mesh = reg.predict(X_mesh).reshape(x1.shape)
# Plot the predictions as a surface. Request 10 contour lines.
contours = plt.contourf(x1, x2, y_pred_mesh, 10, cmap='coolwarm', alpha=0.8) # https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.contourf.html
# Scatter plot of the training data.
# The colors of the points don't mean much except to stand out from the background
plt.scatter(X_train[feature1], X_train[feature2], c=y_train, cmap='coolwarm', s=20)
# Label the contour lines
plt.clabel(contours, inline=1, fontsize=12, colors = "black")
# Label the plot
plt.xlabel(feature1)
plt.ylabel(feature2)
plt.title('Multivariate Linear Regression Contour Plot')
# Show the plot
plt.show()
The output:
I am trying to use scatter plots with regression curves using the following code. I am using different algorithms like Linear regression, SVM and Gaussian Process etc. I have tried different options for plotting the data mentioned below
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVR
from sklearn.gaussian_process import GaussianProcessRegressor
df=pd.read_excel(coded.xlsx)
dfnew=df[['FL','FW','TL','LL','KH']]
Y = df['KH']
X = df[['FL']]
X=X.values.reshape(len(X),1)
Y=Y.values.reshape(len(Y),1)
# Split the data into training/testing sets
X_train = X[:-270]
X_test = X[-270:]
# Split the targets into training/testing sets
Y_train = Y[:-270]
Y_test = Y[-270:]
#regressor = SVR(kernel = 'rbf')
#regressor.fit(X_train, np.ravel(Y_train))
#training the algorithm
regressor = GaussianProcessRegressor(random_state=42)
regressor.fit(X_train, Y_train)
y_pred = regressor.predict(X_test)
mse = np.sum((y_pred - Y_test)**2)
# root mean squared error
# m is the number of training examples
rmse = np.sqrt(mse/270)
print(rmse)
#X_grid = np.arange(min(X), max(X), 0.01) #this step required because data is feature scaled.
#X_grid = np.arange(0, 15, 0.01) #this step required because data is feature scaled.
#X_grid = X_grid.reshape((len(X_grid), 1))
#plt.scatter(X, Y, color = 'red')
print('size of Y_train = {0}'.format(Y_train.size))
print('size of y_pred = {0}'.format(y_pred.size))
#plt.scatter(Y_train, y_pred, color = 'red')
#plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
#plt.title('GPR')
#plt.xlabel('Measured')
#plt.ylabel('Predicted')
#plt.show()
fig, ax = plt.subplots(1, figsize=(12, 6))
plt.plot(X[:, 0], Y_train, marker='o', color='black', linewidth=0)
plt.plot(X[:, 0], y_pred, marker='x', color='steelblue')
plt.suptitle("$GaussianProcessRegressor(kernel=RBF)$ [default]", fontsize=20)
plt.axis('off')
pass
But I am getting error like:
ValueError: x and y must have same first dimension, but have shapes (540,) and (270, 1)
What is the possible solution?
This code splits X and Y into training/testing sets, but then tries to plot a column from all of X with Y_train and y_pred, which have only half as many values as X. Try creating plots with X_train and X_test instead.
I have a polynomial regression script that works correctly to predict values with X and Y axis, in my example I use CPU consumption, below we see an example of the data set:
Complete data set
Where time represents the collection time, example:
1 = 1 minute
2 = 2 minute
And so on...
And consume is the use value of the cpu for that minute, summarizing this data set demonstrates the behavior of a host in the period of 30 minutes, each value corresponding to one minute in ascending order (1min, 2min, 3min ...)
The result for this is:
With this algorithm:
# -*- coding: utf-8 -*-
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
plt.plot(X, pol_reg.predict(poly_reg.fit_transform(X)), color='blue')
plt.title('Polynomial Regression for CPU')
plt.xlabel('Time range')
plt.ylabel('Consume')
plt.show()
return
viz_polymonial()
# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))
What's the problem?
If we duplicate this data set so that the 30 minute range appears 2x, the algorithm does not understand the data set and its result is not as efficient, example of the data set:
--> Up to time = 30
--> Up to time = 30
Complete data set
Note: In the case it has 60 values, where every 30 values represents the range of 30 minutes, it is as if they were different collection days.
The result it shows is this:
Objective: I would like the blue line that represents the polynomial regression to be similar to the first result image, the one we see above demonstrates a loop, where the points are connected, it is as if the algorithm had failed.
Research source
The problem is that in the second case, you plot using X = 1, 2, ... 30, 1, 2, ... 30. The plot function connects successive points. If you just plotted a scatter using pyplot, you would see your nice regression curve. Or you could argsort. Here is the code with the scatter in green, the argsort line in black.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.linear_model import LinearRegression
# Importing the dataset
# dataset = pd.read_csv('data.csv')
dataset = pd.read_csv('data.csv')
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Fitting Polynomial Regression to the dataset
from sklearn.preprocessing import PolynomialFeatures
poly_reg = PolynomialFeatures(degree=4)
X_poly = poly_reg.fit_transform(X)
pol_reg = LinearRegression()
pol_reg.fit(X_poly, y)
# Visualizing the Polymonial Regression results
def viz_polymonial():
plt.scatter(X, y, color='red')
indices = np.argsort(X[:, 0])
plt.scatter(X, pol_reg.predict(poly_reg.fit_transform(X)), color='green')
plt.plot(X[indices], pol_reg.predict(poly_reg.fit_transform(X))[indices], color='black')
plt.title('Polynomial Regression for CPU')
plt.xlabel('Time range')
plt.ylabel('Consume')
plt.show()
return
viz_polymonial()
# 20 = time
print(pol_reg.predict(poly_reg.fit_transform([[20]])))
Here is the output image for the larger dataset.
Data file I would like to process has 71 records build with two columns: one for x value and second one for y value. Main task is to select training part and testing part, print chosen functions (in my example I've taken linear and exponential(^4) one.
However I've stumbled upon error I can't solve.
Full description of the error:
File="zad1.py", line 25, in module
v = np.linalg.pinv(c) # y
ValueError: matmul: Input operand 1 has a mismatch in its core dimension 0,
with gufunc signature (n?, k),(k, m?)->(n?,m?) (size 71 is different from 53)
code
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
a = np.loadtxt('dane10.txt')
x = a[:,[1]]
y = a[:,[0]]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
c = np.hstack([X_train, np.ones(X_train.shape)])
v = np.linalg.pinv(c) # y
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test, 'go')
plt.plot(X_train,v[0]*X_train + v[1])
c = np.hstack([
X_train * X_train * X_train * X_train,
X_train * X_train * X_train,
X_train * X_train,
X_train,
np.ones(X_train.shape)])
v = np.linalg.pinv(c) # y
plt.plot(v[0]*X_train^4 + v[1]*X_train^3 + v[2]*X_train^2 + v[3]*X_train +v[4])
plt.show()
Would appreciate any help :).
I've redone it a little and both functions are being printed now but the exponentail one is kinda weird...I mean something is not right here because it's not adjusting to the points od diagram but it's being printed way further.
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
import matplotlib.pyplot as plt
a = np.loadtxt('dane10.txt')
x = a[:,[1]]
y = a[:,[0]]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
c = np.hstack([x, np.ones(x.shape)])
v = np.linalg.pinv(c) # y
plt.plot(X_train, y_train, 'ro')
plt.plot(X_test, y_test, 'go')
plt.plot(X_train,v[0]*X_train + v[1])
c = np.hstack([
x * x * x * x,
x * x * x,
x * x,
x,
np.ones(x.shape)])
v = np.linalg.pinv(c) # y
plt.plot(v[0]*X_train*X_train*X_train*X_train + v[1]*X_train*X_train*X_train
+
v[2]*X_train*X_train + v[3]*X_train +v[4])
plt.show()
The problem apparently happens when you multiply X_train * X_train.
Since it is not a square matrix, it can not be multiplied by itself. Do you just need to raise each number in X_train to 2d-4th power? In that case use numpy.multiply
I have the following variables:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def part1_scatter():
%matplotlib notebook
plt.figure()
plt.scatter(X_train, y_train, label='training data')
plt.scatter(X_test, y_test, label='test data')
plt.legend(loc=4);
And the following question:
Write a function that fits a polynomial LinearRegression model on the training data X_train for degrees 1, 3, 6, and 9. (Use PolynomialFeatures in sklearn.preprocessing to create the polynomial features and then fit a linear regression model) For each model, find 100 predicted values over the interval x = 0 to 10 (e.g. np.linspace(0,10,100)) and store this in a numpy array. The first row of this array should correspond to the output from the model trained on degree 1, the second row degree 3, the third row degree 6, and the fourth row degree 9.
This is my code, but it don't work out:
def answer_one():
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(0)
n = 15
x = np.linspace(0,10,n) + np.random.randn(n)/5
y = np.sin(x)+x/6 + np.random.randn(n)/10
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
results = []
pred_data = np.linspace(0,10,100)
degree = [1,3,6,9]
y_train1 = y_train.reshape(-1,1)
for i in degree:
poly = PolynomialFeatures(degree=i)
pred_poly1 = poly.fit_transform(pred_data[:,np.newaxis])
X_F1_poly = poly.fit_transform(X_train[:,np.newaxis])
linreg = LinearRegression().fit(X_F1_poly, y_train1)
pred = linreg.predict(pred_poly1)
results.append(pred)
dataArray = np.array(results).reshape(4, 100)
return dataArray
I receive this error:
line 58 for i
in degree: ^ IndentationError: unexpected
indent
Could you tell me where the problem is?
The return statement should be performed after the for is done, so it should be indented under the for, not further in.
At the start of your line
n = 15
You stopped with identing. So that part isn't recognized as the function. This can be solved by putting 4 spaces on all lines from n = 15 onwards.