I am trying to carry out linear regression subject using some constraints to get a certain prediction.
I want to make the model predicting half of the linear prediction, and the last half linear prediction near the last value in the first half using a very narrow range (using constraints) similar to a green line in figure.
The full code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
pd.options.mode.chained_assignment = None # default='warn'
data = [5.269, 5.346, 5.375, 5.482, 5.519, 5.57, 5.593999999999999, 5.627000000000001, 5.724, 5.818, 5.792999999999999, 5.817, 5.8389999999999995, 5.882000000000001, 5.92, 6.025, 6.064, 6.111000000000001, 6.1160000000000005, 6.138, 6.247000000000001, 6.279, 6.332000000000001, 6.3389999999999995, 6.3420000000000005, 6.412999999999999, 6.442, 6.519, 6.596, 6.603, 6.627999999999999, 6.76, 6.837000000000001, 6.781000000000001, 6.8260000000000005, 6.849, 6.875, 6.982, 7.018, 7.042000000000001, 7.068, 7.091, 7.204, 7.228, 7.261, 7.3420000000000005, 7.414, 7.44, 7.516, 7.542000000000001, 7.627000000000001, 7.667000000000001, 7.821000000000001, 7.792999999999999, 7.756, 7.871, 8.006, 8.078, 7.916, 7.974, 8.074, 8.119, 8.228, 7.976, 8.045, 8.312999999999999, 8.335, 8.388, 8.437999999999999, 8.456, 8.227, 8.266, 8.277999999999999, 8.289, 8.299, 8.318, 8.332, 8.34, 8.349, 8.36, 8.363999999999999, 8.368, 8.282, 8.283999999999999]
time = range(1,85,1)
x=int(0.7*len(data))
df = pd.DataFrame(list(zip(*[time, data])))
df.columns = ['time', 'data']
# print df
x=int(0.7*len(df))
train = df[:x]
valid = df[x:]
models = []
names = []
tr_x_ax = []
va_x_ax = []
pr_x_ax = []
tr_y_ax = []
va_y_ax = []
pr_y_ax = []
time_model = []
models.append(('LR', LinearRegression()))
for name, model in models:
x_train=df.iloc[:, 0][:x].values
y_train=df.iloc[:, 1][:x].values
x_valid=df.iloc[:, 0][x:].values
y_valid=df.iloc[:, 1][x:].values
model = LinearRegression()
# poly = PolynomialFeatures(5)
x_train= x_train.reshape(-1, 1)
y_train= y_train.reshape(-1, 1)
x_valid = x_valid.reshape(-1, 1)
y_valid = y_valid.reshape(-1, 1)
# model.fit(x_train,y_train)
model.fit(x_train,y_train.ravel())
# score = model.score(x_train,y_train.ravel())
# print 'score', score
preds = model.predict(x_valid)
tr_x_ax.extend(train['data'])
va_x_ax.extend(valid['data'])
pr_x_ax.extend(preds)
valid['Predictions'] = preds
valid.index = df[x:].index
train.index = df[:x].index
plt.figure(figsize=(5,5))
# plt.plot(train['data'],label='data')
# plt.plot(valid[['Close', 'Predictions']])
x = valid['data']
# print x
# plt.plot(valid['data'],label='validation')
plt.plot(valid['Predictions'],label='Predictions before',color='orange')
y =range(0,58)
y1 =range(58,84)
for index, item in enumerate(pr_x_ax):
if index >13:
pr_x_ax[index] = pr_x_ax[13]
pr_x_ax = list([float(i) for i in pr_x_ax])
va_x_ax = list([float(i) for i in va_x_ax])
tr_x_ax = list([float(i) for i in tr_x_ax])
plt.plot(y,tr_x_ax, label='train' , color='red', linewidth=2)
plt.plot(y1,va_x_ax, label='validation1' , color='blue', linewidth=2)
plt.plot(y1,pr_x_ax, label='Predictions after' , color='green', linewidth=2)
plt.xlabel("time")
plt.ylabel("data")
plt.xticks(rotation=45)
plt.legend()
plt.show()
If you see this figure:
label: Predictions before, the model predicted it without any constraints (I don't need this result).
label: Predictions after, the model predicted it within a constraint but this is after the model predicted AND the all values are equal to last value at index = 71 , item 8.56.
I used for loop for index, item in enumerate(pr_x_ax): in line:64, and the curve is line straight from time 71 to 85 sec as you see in order to show you how I need the model work.
Could I build the model give the same result instead of for loop???
Please your suggestions
I expect that in your question by drawing green line you really expect trained model to predict linear horizontal turn to the right. But current trained model draws just straight orange line.
It is true for any trained model of any algorithm and type that in order to learn some unordinary change in behavior model needs to have at least some samples of that unordinary change. Or at least some hidden meaning in observed data should point to having such unordinary change.
In other words for your model to learn that right turn on green line a model should have points with that right turn in the training data set. But you take for training data just first (leftmost) 70% of data by train = df[:int(0.7 * len(df))] and that training data has no such right turns and this training data just looks close to one straight line.
So you need to re-sample your data into training and validation in a different way - take randomly 70% of samples from whole range of X and the rest goes to validation. So that in your training data samples that do right turn also included.
Second thing is that LinearRegression model always models predictions just with one single straight line, and this line can't have right turns. In order to have right turns you need some more complex model.
One way for a model to have a right turn is to be piece-wise-linear, i.e. having several joined straight lines. I didn't find ready-made piecewise linear models inside sklearn, only using other pip models. So I decided to implement my own simple class PieceWiseLinearRegression that uses np.piecewise() and scipy.optimize.curve_fit() in order to model piecewise linear function.
Next picture shows results of applying two mentioned things above, code goes afterwards, re-sampling dataset in a different way and modeling piece-wise-linear function. Your current linear model LR still makes a prediction using just one straight blue line, while my piecewise linear PWLR2, orange line, consists of two segments and correctly predicts right turn:
To see clearly just one PWLR2 graph I did next picture too:
My class PieceWiseLinearRegression on creation of object accepts just one argument n - number of linear segments to be used for prediction. For picture above n = 2 was used.
import sys, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
np.random.seed(0)
class PieceWiseLinearRegression:
#classmethod
def nargs_func(cls, f, n):
return eval('lambda ' + ', '.join([f'a{i}'for i in range(n)]) + ': f(' + ', '.join([f'a{i}'for i in range(n)]) + ')', locals())
#classmethod
def piecewise_linear(cls, n):
condlist = lambda xs, xa: [(lambda x: (
(xs[i] <= x if i > 0 else np.full_like(x, True, dtype = np.bool_)) &
(x < xs[i + 1] if i < n - 1 else np.full_like(x, True, dtype = np.bool_))
))(xa) for i in range(n)]
funclist = lambda xs, ys: [(lambda i: (
lambda x: (
(x - xs[i]) * (ys[i + 1] - ys[i]) / (
(xs[i + 1] - xs[i]) if abs(xs[i + 1] - xs[i]) > 10 ** -7 else 10 ** -7 * (-1, 1)[xs[i + 1] - xs[i] >= 0]
) + ys[i]
)
))(j) for j in range(n)]
def f(x, *pargs):
assert len(pargs) == (n + 1) * 2, (n, pargs)
xs, ys = pargs[0::2], pargs[1::2]
xa = x.ravel().astype(np.float64)
ya = np.piecewise(x = xa, condlist = condlist(xs, xa), funclist = funclist(xs, ys)).ravel()
#print('xs', xs, 'ys', ys, 'xa', xa, 'ya', ya)
return ya
return cls.nargs_func(f, 1 + (n + 1) * 2)
def __init__(self, n):
self.n = n
self.f = self.piecewise_linear(self.n)
def fit(self, x, y):
from scipy import optimize
self.p, self.e = optimize.curve_fit(self.f, x, y, p0 = [j for i in range(self.n + 1) for j in (np.amin(x) + i * (np.amax(x) - np.amin(x)) / self.n, 1)])
#print('p', self.p)
def predict(self, x):
return self.f(x, *self.p)
data = [5.269, 5.346, 5.375, 5.482, 5.519, 5.57, 5.593999999999999, 5.627000000000001, 5.724, 5.818, 5.792999999999999, 5.817, 5.8389999999999995, 5.882000000000001, 5.92, 6.025, 6.064, 6.111000000000001, 6.1160000000000005, 6.138, 6.247000000000001, 6.279, 6.332000000000001, 6.3389999999999995, 6.3420000000000005, 6.412999999999999, 6.442, 6.519, 6.596, 6.603, 6.627999999999999, 6.76, 6.837000000000001, 6.781000000000001, 6.8260000000000005, 6.849, 6.875, 6.982, 7.018, 7.042000000000001, 7.068, 7.091, 7.204, 7.228, 7.261, 7.3420000000000005, 7.414, 7.44, 7.516, 7.542000000000001, 7.627000000000001, 7.667000000000001, 7.821000000000001, 7.792999999999999, 7.756, 7.871, 8.006, 8.078, 7.916, 7.974, 8.074, 8.119, 8.228, 7.976, 8.045, 8.312999999999999, 8.335, 8.388, 8.437999999999999, 8.456, 8.227, 8.266, 8.277999999999999, 8.289, 8.299, 8.318, 8.332, 8.34, 8.349, 8.36, 8.363999999999999, 8.368, 8.282, 8.283999999999999]
time = list(range(1, 85))
df = pd.DataFrame(list(zip(time, data)), columns = ['time', 'data'])
choose_train = np.random.uniform(size = (len(df),)) < 0.8
choose_valid = ~choose_train
x_all = df.iloc[:, 0].values
y_all = df.iloc[:, 1].values
x_train = df.iloc[:, 0][choose_train].values
y_train = df.iloc[:, 1][choose_train].values
x_valid = df.iloc[:, 0][choose_valid].values
y_valid = df.iloc[:, 1][choose_valid].values
x_all_lin = np.linspace(np.amin(x_all), np.amax(x_all), 500)
models = []
models.append(('LR', LinearRegression()))
models.append(('PWLR2', PieceWiseLinearRegression(2)))
for imodel, (name, model) in enumerate(models):
model.fit(x_train[:, None], y_train)
x_all_lin_pred = model.predict(x_all_lin[:, None])
plt.plot(x_all_lin, x_all_lin_pred, label = f'pred {name}')
plt.plot(x_train, y_train, label='train')
plt.plot(x_valid, y_valid, label='valid')
plt.xlabel('time')
plt.ylabel('data')
plt.legend()
plt.show()
I have some issues with Linear Regression, I just used a simple sample and I still get error, don't know what I'm doing wrong.
Here's the code:
x = [1,1,2,3,1,1,2,0,4,1]
x = np.array(x)
x = np.reshape(1,-1)
y = [1.24,0.88,0.88,1.31,1.36,0.79,0.79,0.79,1.36,1.36]
y = np.array(y)
y = np.reshape(1,-1)
lin_reg = LinearRegression()
lin_reg.fit(x,y)
"ValueError: Expected 2D array, got 1D array instead:
array=[1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
The error says what you should do in this case.
Just use .reshape(-1, 1) instead of .reshape(1,-1).
Do it only for x and the problem is solved.
x = [1,1,2,3,1,1,2,0,4,1]
x = np.array(x).reshape(-1, 1) # Edited line
y = [1.24,0.88,0.88,1.31,1.36,0.79,0.79,0.79,1.36,1.36]
lin_reg = LinearRegression()
lin_reg.fit(x,y)
you called the reshape function wrong.
If you want to reshape the x or y matrices, you should call like:
x = x.reshape(1, -1)
or combining with one line before:
x = np.array(x).reshape(1, -1)
If you only call np.reshape(), nothing would happen to your data.
I have an array of values that range from 0 to 1 that relates to the output truth values for a neural network I'm building. However the distribution is very wide and uneven, so I was curious if there was a package for Python that could remove samples so that the distribution is more even across the array.
Here's the distribution plot from seaborn's seaborn.distplot().
What I'd like to do is essentially specify a value of how many 'sections' to break the array into, and to remove values from the largest sections so that the distribution is more even.
The plot from the output of this function would probably look something like this:
Does there exist any kind of built-in package for numpy, or scipy to do this?
If this can help anyone in the future, here's what I came up with:
def reject_outliers(x_t, y_t, m):
mean = np.mean(y_t)
std = np.std(y_t)
x_t, y_t = zip(*[[x, y] for x, y in zip(x_t, y_t) if abs(y - mean) < (m * std)])
return list(x_t), np.array(y_t)
def even_out_distribution(x_t, y_t, n_sections, reduction=0.5, reduce_min=.5, m=2):
x_t, y_t = reject_outliers(x_t, y_t, m)
linspace = np.linspace(np.min(y_t), np.max(y_t), n_sections + 1)
sections = [[] for i in range(n_sections)]
for x, y in zip(x_t, y_t):
where = max(np.searchsorted(linspace, y) - 1, 0)
sections[where].append([x, y])
sections = [sec for sec in sections if sec != []]
min_section = min([len(i) for i in sections]) # np.mean([len(i) for i in sections]) * reduce_min # todo: in replace of min([len(i) for i in sections])
print([len(i) for i in sections])
new_sections = []
for section in sections:
this_section = list(section)
if len(section) > min_section:
to_remove = (len(section) - min_section) * reduction
for i in range(int(to_remove)):
this_section.pop(random.randrange(len(this_section)))
new_sections.append(this_section)
print([len(i) for i in new_sections])
output = [inner for outer in new_sections for inner in outer]
x_t, y_t = zip(*output)
return list(x_t), np.array(y_t)
I made it so that if it deletes a sample from y_train, it deletes it from your input list, or x_train.
Usage: x, y = even_out_distribution(x_train, y_train, n_sections=10, m=2, reduction=.8)
and now my distribution plot looks a lot nicer now!
I want to delete in X_test and in y_test every row where MFD is bigger one. The problem is, that i always get the random mixed indices from Train / Test / Split. If i try to drop it i get the following Error Message:
IndexError: index 3779 is out of bounds for axis 1 with size 3488
I cant use the old indices to drop it, but how can i get the new ones where MFD > 1
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=test_size,
random_state=random_state,
stratify=y)
mfd_drop_rows = []
i_nr = 0
for i in X_test.MFD:
if (i > 1):
mfd_drop_rows.append(X_test.index[i_nr])
i_nr += 1
X_test_new = X_test.drop(X_test.index[mfd_drop_rows])
y_test_new = Y_test.drop(Y_test.index[mfd_drop_rows])
Thanks for your help ( =
Not sure what MFD is but assuming that X_test.MFD gives you an array of numbers you could use a mask to drop rows. A simple example of how to use a mask can be seen here:
x = [[1,2,3,4,5],[6,7,8,9,10]]
mfd = [0.6, 1.3]
mask = x > 1
x_new = x[mask,:]
This would give:
x = [1,2,3,4,5
6,7,8,9,10]
mask = [False, True]
x_new = [6,7,8,9,10]
I ve solved it im sry, i just use my i_nr iteration and have the new indice.
Thanks to everyone who read it
mfd_drop_rows = []
i_nr = 0
for i in X_test.MFD:
if (i > 1):
mfd_drop_rows.append(i_nr)
i_nr += 1
X_test_new = X_test.drop(X_test.index[mfd_drop_rows])
y_test_new = Y_test.drop(Y_test.index[mfd_drop_rows])