How can I reshape my array to fit (4,100) - python

This is my code but when I run it, I am not getting the correct shape. I need it to return a numpy array of the shape (4,100).
To get an idea of what I'm doing, I am fitting a polynomial LinearRegression model on the training data for the specified degrees then generating predictions for the polynomial's values by transposing the 100 row, single column output into a single row, 100 column array.
np.random.seed(0)
C = 15
n = 60
x = np.linspace(0, 20, n) # x is drawn from a fixed range
y = x ** 3 / 20 - x ** 2 - x + C * np.random.randn(n)
x = x.reshape(-1, 1) # convert x and y from simple array to a 1-column matrix for input to sklearn regression
y = y.reshape(-1, 1)
# Create the training and testing sets and their targets
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def model():
degs = (1, 3, 7, 11)
#Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it
#contains a single sample.
def poly_y(i):
poly = PolynomialFeatures(degree = i)
x_poly = poly.fit_transform(X_train.reshape(-1,1))
linreg = LinearRegression().fit(x_poly, y_train)
#x_orig = np.linspace(0, 20, 100)
y_pred = linreg.predict(poly.fit_transform(np.linspace(0, 20, 100).reshape(-1,1)))
y_pred = y_pred.T
return(y_pred.reshape(-1,1))
ans = poly_y(1)
for i in degs:
temp = poly_y(i)
ans = np.vstack([ans, temp])
return ans
model()
Image of output:

Combining the comments to your question, with a brief explanation:
You're currently doing
ans = poly_y(1)
for i in degs:
temp = poly_y(i)
ans = np.vstack([ans, temp])
You set ans to the result for a degree of one, then loop through all degrees and stack those to ans. But, all degrees include 1, so you get degree 1 twice, and end up with a 500 by 1 array. Thus, you can remove the first line. Then, you have this loop where you repeatedly stack to ans, which can be done in one go, using a list comprehension (e.g., with [poly_y(deg) for deg in degs]). Stacking that results in a 400 by 1 array, which is not what you want. You could reshape that, or you could use hstack. The latter returns a 100 by 4 array; to get a 4 by 100 array, just transpose that.
So the final solution would be to replace the above four lines with
ans = np.hstack([poly_y(deg) for deg in degs]).T
(and if you want to get more fancy, replace those lines and the return ans line with
return np.hstack([poly_y(deg) for deg in degs]).T
)

Related

How to do Constrained Linear Regression - scikit learn?

I am trying to carry out linear regression subject using some constraints to get a certain prediction.
I want to make the model predicting half of the linear prediction, and the last half linear prediction near the last value in the first half using a very narrow range (using constraints) similar to a green line in figure.
The full code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
pd.options.mode.chained_assignment = None # default='warn'
data = [5.269, 5.346, 5.375, 5.482, 5.519, 5.57, 5.593999999999999, 5.627000000000001, 5.724, 5.818, 5.792999999999999, 5.817, 5.8389999999999995, 5.882000000000001, 5.92, 6.025, 6.064, 6.111000000000001, 6.1160000000000005, 6.138, 6.247000000000001, 6.279, 6.332000000000001, 6.3389999999999995, 6.3420000000000005, 6.412999999999999, 6.442, 6.519, 6.596, 6.603, 6.627999999999999, 6.76, 6.837000000000001, 6.781000000000001, 6.8260000000000005, 6.849, 6.875, 6.982, 7.018, 7.042000000000001, 7.068, 7.091, 7.204, 7.228, 7.261, 7.3420000000000005, 7.414, 7.44, 7.516, 7.542000000000001, 7.627000000000001, 7.667000000000001, 7.821000000000001, 7.792999999999999, 7.756, 7.871, 8.006, 8.078, 7.916, 7.974, 8.074, 8.119, 8.228, 7.976, 8.045, 8.312999999999999, 8.335, 8.388, 8.437999999999999, 8.456, 8.227, 8.266, 8.277999999999999, 8.289, 8.299, 8.318, 8.332, 8.34, 8.349, 8.36, 8.363999999999999, 8.368, 8.282, 8.283999999999999]
time = range(1,85,1)
x=int(0.7*len(data))
df = pd.DataFrame(list(zip(*[time, data])))
df.columns = ['time', 'data']
# print df
x=int(0.7*len(df))
train = df[:x]
valid = df[x:]
models = []
names = []
tr_x_ax = []
va_x_ax = []
pr_x_ax = []
tr_y_ax = []
va_y_ax = []
pr_y_ax = []
time_model = []
models.append(('LR', LinearRegression()))
for name, model in models:
x_train=df.iloc[:, 0][:x].values
y_train=df.iloc[:, 1][:x].values
x_valid=df.iloc[:, 0][x:].values
y_valid=df.iloc[:, 1][x:].values
model = LinearRegression()
# poly = PolynomialFeatures(5)
x_train= x_train.reshape(-1, 1)
y_train= y_train.reshape(-1, 1)
x_valid = x_valid.reshape(-1, 1)
y_valid = y_valid.reshape(-1, 1)
# model.fit(x_train,y_train)
model.fit(x_train,y_train.ravel())
# score = model.score(x_train,y_train.ravel())
# print 'score', score
preds = model.predict(x_valid)
tr_x_ax.extend(train['data'])
va_x_ax.extend(valid['data'])
pr_x_ax.extend(preds)
valid['Predictions'] = preds
valid.index = df[x:].index
train.index = df[:x].index
plt.figure(figsize=(5,5))
# plt.plot(train['data'],label='data')
# plt.plot(valid[['Close', 'Predictions']])
x = valid['data']
# print x
# plt.plot(valid['data'],label='validation')
plt.plot(valid['Predictions'],label='Predictions before',color='orange')
y =range(0,58)
y1 =range(58,84)
for index, item in enumerate(pr_x_ax):
if index >13:
pr_x_ax[index] = pr_x_ax[13]
pr_x_ax = list([float(i) for i in pr_x_ax])
va_x_ax = list([float(i) for i in va_x_ax])
tr_x_ax = list([float(i) for i in tr_x_ax])
plt.plot(y,tr_x_ax, label='train' , color='red', linewidth=2)
plt.plot(y1,va_x_ax, label='validation1' , color='blue', linewidth=2)
plt.plot(y1,pr_x_ax, label='Predictions after' , color='green', linewidth=2)
plt.xlabel("time")
plt.ylabel("data")
plt.xticks(rotation=45)
plt.legend()
plt.show()
If you see this figure:
label: Predictions before, the model predicted it without any constraints (I don't need this result).
label: Predictions after, the model predicted it within a constraint but this is after the model predicted AND the all values are equal to last value at index = 71 , item 8.56.
I used for loop for index, item in enumerate(pr_x_ax): in line:64, and the curve is line straight from time 71 to 85 sec as you see in order to show you how I need the model work.
Could I build the model give the same result instead of for loop???
Please your suggestions
I expect that in your question by drawing green line you really expect trained model to predict linear horizontal turn to the right. But current trained model draws just straight orange line.
It is true for any trained model of any algorithm and type that in order to learn some unordinary change in behavior model needs to have at least some samples of that unordinary change. Or at least some hidden meaning in observed data should point to having such unordinary change.
In other words for your model to learn that right turn on green line a model should have points with that right turn in the training data set. But you take for training data just first (leftmost) 70% of data by train = df[:int(0.7 * len(df))] and that training data has no such right turns and this training data just looks close to one straight line.
So you need to re-sample your data into training and validation in a different way - take randomly 70% of samples from whole range of X and the rest goes to validation. So that in your training data samples that do right turn also included.
Second thing is that LinearRegression model always models predictions just with one single straight line, and this line can't have right turns. In order to have right turns you need some more complex model.
One way for a model to have a right turn is to be piece-wise-linear, i.e. having several joined straight lines. I didn't find ready-made piecewise linear models inside sklearn, only using other pip models. So I decided to implement my own simple class PieceWiseLinearRegression that uses np.piecewise() and scipy.optimize.curve_fit() in order to model piecewise linear function.
Next picture shows results of applying two mentioned things above, code goes afterwards, re-sampling dataset in a different way and modeling piece-wise-linear function. Your current linear model LR still makes a prediction using just one straight blue line, while my piecewise linear PWLR2, orange line, consists of two segments and correctly predicts right turn:
To see clearly just one PWLR2 graph I did next picture too:
My class PieceWiseLinearRegression on creation of object accepts just one argument n - number of linear segments to be used for prediction. For picture above n = 2 was used.
import sys, numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
np.random.seed(0)
class PieceWiseLinearRegression:
#classmethod
def nargs_func(cls, f, n):
return eval('lambda ' + ', '.join([f'a{i}'for i in range(n)]) + ': f(' + ', '.join([f'a{i}'for i in range(n)]) + ')', locals())
#classmethod
def piecewise_linear(cls, n):
condlist = lambda xs, xa: [(lambda x: (
(xs[i] <= x if i > 0 else np.full_like(x, True, dtype = np.bool_)) &
(x < xs[i + 1] if i < n - 1 else np.full_like(x, True, dtype = np.bool_))
))(xa) for i in range(n)]
funclist = lambda xs, ys: [(lambda i: (
lambda x: (
(x - xs[i]) * (ys[i + 1] - ys[i]) / (
(xs[i + 1] - xs[i]) if abs(xs[i + 1] - xs[i]) > 10 ** -7 else 10 ** -7 * (-1, 1)[xs[i + 1] - xs[i] >= 0]
) + ys[i]
)
))(j) for j in range(n)]
def f(x, *pargs):
assert len(pargs) == (n + 1) * 2, (n, pargs)
xs, ys = pargs[0::2], pargs[1::2]
xa = x.ravel().astype(np.float64)
ya = np.piecewise(x = xa, condlist = condlist(xs, xa), funclist = funclist(xs, ys)).ravel()
#print('xs', xs, 'ys', ys, 'xa', xa, 'ya', ya)
return ya
return cls.nargs_func(f, 1 + (n + 1) * 2)
def __init__(self, n):
self.n = n
self.f = self.piecewise_linear(self.n)
def fit(self, x, y):
from scipy import optimize
self.p, self.e = optimize.curve_fit(self.f, x, y, p0 = [j for i in range(self.n + 1) for j in (np.amin(x) + i * (np.amax(x) - np.amin(x)) / self.n, 1)])
#print('p', self.p)
def predict(self, x):
return self.f(x, *self.p)
data = [5.269, 5.346, 5.375, 5.482, 5.519, 5.57, 5.593999999999999, 5.627000000000001, 5.724, 5.818, 5.792999999999999, 5.817, 5.8389999999999995, 5.882000000000001, 5.92, 6.025, 6.064, 6.111000000000001, 6.1160000000000005, 6.138, 6.247000000000001, 6.279, 6.332000000000001, 6.3389999999999995, 6.3420000000000005, 6.412999999999999, 6.442, 6.519, 6.596, 6.603, 6.627999999999999, 6.76, 6.837000000000001, 6.781000000000001, 6.8260000000000005, 6.849, 6.875, 6.982, 7.018, 7.042000000000001, 7.068, 7.091, 7.204, 7.228, 7.261, 7.3420000000000005, 7.414, 7.44, 7.516, 7.542000000000001, 7.627000000000001, 7.667000000000001, 7.821000000000001, 7.792999999999999, 7.756, 7.871, 8.006, 8.078, 7.916, 7.974, 8.074, 8.119, 8.228, 7.976, 8.045, 8.312999999999999, 8.335, 8.388, 8.437999999999999, 8.456, 8.227, 8.266, 8.277999999999999, 8.289, 8.299, 8.318, 8.332, 8.34, 8.349, 8.36, 8.363999999999999, 8.368, 8.282, 8.283999999999999]
time = list(range(1, 85))
df = pd.DataFrame(list(zip(time, data)), columns = ['time', 'data'])
choose_train = np.random.uniform(size = (len(df),)) < 0.8
choose_valid = ~choose_train
x_all = df.iloc[:, 0].values
y_all = df.iloc[:, 1].values
x_train = df.iloc[:, 0][choose_train].values
y_train = df.iloc[:, 1][choose_train].values
x_valid = df.iloc[:, 0][choose_valid].values
y_valid = df.iloc[:, 1][choose_valid].values
x_all_lin = np.linspace(np.amin(x_all), np.amax(x_all), 500)
models = []
models.append(('LR', LinearRegression()))
models.append(('PWLR2', PieceWiseLinearRegression(2)))
for imodel, (name, model) in enumerate(models):
model.fit(x_train[:, None], y_train)
x_all_lin_pred = model.predict(x_all_lin[:, None])
plt.plot(x_all_lin, x_all_lin_pred, label = f'pred {name}')
plt.plot(x_train, y_train, label='train')
plt.plot(x_valid, y_valid, label='valid')
plt.xlabel('time')
plt.ylabel('data')
plt.legend()
plt.show()

Can't figure the issue with these simple lines of code for Linear Regression

I have some issues with Linear Regression, I just used a simple sample and I still get error, don't know what I'm doing wrong.
Here's the code:
x = [1,1,2,3,1,1,2,0,4,1]
x = np.array(x)
x = np.reshape(1,-1)
y = [1.24,0.88,0.88,1.31,1.36,0.79,0.79,0.79,1.36,1.36]
y = np.array(y)
y = np.reshape(1,-1)
lin_reg = LinearRegression()
lin_reg.fit(x,y)
"ValueError: Expected 2D array, got 1D array instead:
array=[1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
The error says what you should do in this case.
Just use .reshape(-1, 1) instead of .reshape(1,-1).
Do it only for x and the problem is solved.
x = [1,1,2,3,1,1,2,0,4,1]
x = np.array(x).reshape(-1, 1) # Edited line
y = [1.24,0.88,0.88,1.31,1.36,0.79,0.79,0.79,1.36,1.36]
lin_reg = LinearRegression()
lin_reg.fit(x,y)
you called the reshape function wrong.
If you want to reshape the x or y matrices, you should call like:
x = x.reshape(1, -1)
or combining with one line before:
x = np.array(x).reshape(1, -1)
If you only call np.reshape(), nothing would happen to your data.

Is there any kind of package that includes a function to smooth out, or even out distribution in array by deleting samples?

I have an array of values that range from 0 to 1 that relates to the output truth values for a neural network I'm building. However the distribution is very wide and uneven, so I was curious if there was a package for Python that could remove samples so that the distribution is more even across the array.
Here's the distribution plot from seaborn's seaborn.distplot().
What I'd like to do is essentially specify a value of how many 'sections' to break the array into, and to remove values from the largest sections so that the distribution is more even.
The plot from the output of this function would probably look something like this:
Does there exist any kind of built-in package for numpy, or scipy to do this?
If this can help anyone in the future, here's what I came up with:
def reject_outliers(x_t, y_t, m):
mean = np.mean(y_t)
std = np.std(y_t)
x_t, y_t = zip(*[[x, y] for x, y in zip(x_t, y_t) if abs(y - mean) < (m * std)])
return list(x_t), np.array(y_t)
def even_out_distribution(x_t, y_t, n_sections, reduction=0.5, reduce_min=.5, m=2):
x_t, y_t = reject_outliers(x_t, y_t, m)
linspace = np.linspace(np.min(y_t), np.max(y_t), n_sections + 1)
sections = [[] for i in range(n_sections)]
for x, y in zip(x_t, y_t):
where = max(np.searchsorted(linspace, y) - 1, 0)
sections[where].append([x, y])
sections = [sec for sec in sections if sec != []]
min_section = min([len(i) for i in sections]) # np.mean([len(i) for i in sections]) * reduce_min # todo: in replace of min([len(i) for i in sections])
print([len(i) for i in sections])
new_sections = []
for section in sections:
this_section = list(section)
if len(section) > min_section:
to_remove = (len(section) - min_section) * reduction
for i in range(int(to_remove)):
this_section.pop(random.randrange(len(this_section)))
new_sections.append(this_section)
print([len(i) for i in new_sections])
output = [inner for outer in new_sections for inner in outer]
x_t, y_t = zip(*output)
return list(x_t), np.array(y_t)
I made it so that if it deletes a sample from y_train, it deletes it from your input list, or x_train.
Usage: x, y = even_out_distribution(x_train, y_train, n_sections=10, m=2, reduction=.8)
and now my distribution plot looks a lot nicer now!

Python / How to delete specific rows in testing data with indices after / train / test / split

I want to delete in X_test and in y_test every row where MFD is bigger one. The problem is, that i always get the random mixed indices from Train / Test / Split. If i try to drop it i get the following Error Message:
IndexError: index 3779 is out of bounds for axis 1 with size 3488
I cant use the old indices to drop it, but how can i get the new ones where MFD > 1
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=test_size,
random_state=random_state,
stratify=y)
mfd_drop_rows = []
i_nr = 0
for i in X_test.MFD:
if (i > 1):
mfd_drop_rows.append(X_test.index[i_nr])
i_nr += 1
X_test_new = X_test.drop(X_test.index[mfd_drop_rows])
y_test_new = Y_test.drop(Y_test.index[mfd_drop_rows])
Thanks for your help ( =
Not sure what MFD is but assuming that X_test.MFD gives you an array of numbers you could use a mask to drop rows. A simple example of how to use a mask can be seen here:
x = [[1,2,3,4,5],[6,7,8,9,10]]
mfd = [0.6, 1.3]
mask = x > 1
x_new = x[mask,:]
This would give:
x = [1,2,3,4,5
6,7,8,9,10]
mask = [False, True]
x_new = [6,7,8,9,10]
I ve solved it im sry, i just use my i_nr iteration and have the new indice.
Thanks to everyone who read it
mfd_drop_rows = []
i_nr = 0
for i in X_test.MFD:
if (i > 1):
mfd_drop_rows.append(i_nr)
i_nr += 1
X_test_new = X_test.drop(X_test.index[mfd_drop_rows])
y_test_new = Y_test.drop(Y_test.index[mfd_drop_rows])

Scikit-learn SVM: Reshaping X leads to incompatible shapes

I try to use scikit-learn SVM to predict whether a stock from S&P500 beats the index or not.
I have the 'sample' file from which I extract the features X and the labels (beats the index or doesn't beat it) Y.
When I tried it the first time (without reshaping X) I got the the following depreciation error:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
Consequently I tried the reshaping of X according to the recommendation and also to some forum posts.
Now however I get the following value error that X and Y don't have the same shape.
ValueError: X and y have incompatible shapes.
X has 4337 samples, but y has 393.
Below you can see the shapes of X and Y before reshaping:
('Shape of X = ', (493, 9))
('Shape of Y = ', (493,))
and after reshaping:
('Shape of X = ', (4437, 1))
('Shape of Y = ', (493,))
I also tried to reshape so that I get the (493,9) shape, but also this didn't work as I got the following error.
ValueError: total size of new array must be unchanged.
I posted below the code to extract the features and labels from the pandas DataFrame and and the SVM analysis:
Feature & Label selection:
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
X = X.reshape(-1,1)
Y = sample['status'].values.tolist()
Y = np.array(Y)
Z = np.array(sample[['changemktvalue', 'benchmark']])
SVM testing:
test_size = 50
invest_amount = 1000
total_invests = 0
if_market = 0
if_strat = 0
clf = svm.SVC(kernel="linear", C= 1.0)
clf.fit(X[:-test_size],Y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x])[0] == Y[-x]:
correct_count += 1
if clf.predict(X[-x])[0] == 1:
invest_return = invest_amount + (invest_amount * (Z[-x][0]/100)) #zeroth element of z
market_return = invest_amount + (invest_amount * (Z[-x][1]/100)) #marketsp500 is at pos 1
total_invests += 1
if_market += market_return
if_strat += invest_return
print("Accuracy:", (float(correct_count)/test_size) * 100.00)
Would be great if you have any inputs on how to solve this.
You should not be reshaping X to (-1, 1). In fact the error is in your call to the predict method.
Change
clf.predict(X[-x])[0]
to
clf.predict(X[-x].reshape((-1, 9)))[0]

Categories

Resources