I try to use scikit-learn SVM to predict whether a stock from S&P500 beats the index or not.
I have the 'sample' file from which I extract the features X and the labels (beats the index or doesn't beat it) Y.
When I tried it the first time (without reshaping X) I got the the following depreciation error:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
Consequently I tried the reshaping of X according to the recommendation and also to some forum posts.
Now however I get the following value error that X and Y don't have the same shape.
ValueError: X and y have incompatible shapes.
X has 4337 samples, but y has 393.
Below you can see the shapes of X and Y before reshaping:
('Shape of X = ', (493, 9))
('Shape of Y = ', (493,))
and after reshaping:
('Shape of X = ', (4437, 1))
('Shape of Y = ', (493,))
I also tried to reshape so that I get the (493,9) shape, but also this didn't work as I got the following error.
ValueError: total size of new array must be unchanged.
I posted below the code to extract the features and labels from the pandas DataFrame and and the SVM analysis:
Feature & Label selection:
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
X = X.reshape(-1,1)
Y = sample['status'].values.tolist()
Y = np.array(Y)
Z = np.array(sample[['changemktvalue', 'benchmark']])
SVM testing:
test_size = 50
invest_amount = 1000
total_invests = 0
if_market = 0
if_strat = 0
clf = svm.SVC(kernel="linear", C= 1.0)
clf.fit(X[:-test_size],Y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x])[0] == Y[-x]:
correct_count += 1
if clf.predict(X[-x])[0] == 1:
invest_return = invest_amount + (invest_amount * (Z[-x][0]/100)) #zeroth element of z
market_return = invest_amount + (invest_amount * (Z[-x][1]/100)) #marketsp500 is at pos 1
total_invests += 1
if_market += market_return
if_strat += invest_return
print("Accuracy:", (float(correct_count)/test_size) * 100.00)
Would be great if you have any inputs on how to solve this.
You should not be reshaping X to (-1, 1). In fact the error is in your call to the predict method.
Change
clf.predict(X[-x])[0]
to
clf.predict(X[-x].reshape((-1, 9)))[0]
Related
I want to use a hurdle model from statsmodels (https://www.statsmodels.org/dev/examples/notebooks/generated/count_hurdle.html).
Unfortunately, when following the manual, I am running into several problems:
I am only able to load the discrete_model and count_model from discrete:
Hence, the following work
import statsmodels.discrete.discrete_model
import statsmodels.discrete.count_model
However, there is no module named 'statsmodels.discrete.truncated_model'
Furthermore, I am not able to run
get_diagnostic()
with
a poisson regression. I am running the same code as on the statsmodels website (see below), but it throughs the following error:
'PoissonResults' object has no attribute 'get_diagnostic'
Thanks for your help!
np.random.seed(987456348)
# large sample to get strong results
nobs = 5000
x = np.column_stack((np.ones(nobs), np.linspace(0, 1, nobs)))
mu0 = np.exp(0.5 *2 * x.sum(1))
y = np.random.poisson(mu0, size=nobs)
print(np.bincount(y))
y_ = y
indices = np.arange(len(y))
mask = mask0 = y > 0
for _ in range(10):
print( mask.sum())
indices = mask #indices[mask]
if not np.any(mask):
break
mu_ = np.exp(0.5 * x[indices].sum(1))
y[indices] = y_ = np.random.poisson(mu_, size=len(mu_))
np.place(y, mask, y_)
mask = np.logical_and(mask0, y == 0)
np.bincount(y)
mod_p = Poisson(y, x)
res_p = mod_p.fit()
print(res_p.summary())
dia_p = res_p.get_diagnostic()
dia_p.plot_probs();
This is my code but when I run it, I am not getting the correct shape. I need it to return a numpy array of the shape (4,100).
To get an idea of what I'm doing, I am fitting a polynomial LinearRegression model on the training data for the specified degrees then generating predictions for the polynomial's values by transposing the 100 row, single column output into a single row, 100 column array.
np.random.seed(0)
C = 15
n = 60
x = np.linspace(0, 20, n) # x is drawn from a fixed range
y = x ** 3 / 20 - x ** 2 - x + C * np.random.randn(n)
x = x.reshape(-1, 1) # convert x and y from simple array to a 1-column matrix for input to sklearn regression
y = y.reshape(-1, 1)
# Create the training and testing sets and their targets
X_train, X_test, y_train, y_test = train_test_split(x, y, random_state=0)
def model():
degs = (1, 3, 7, 11)
#Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it
#contains a single sample.
def poly_y(i):
poly = PolynomialFeatures(degree = i)
x_poly = poly.fit_transform(X_train.reshape(-1,1))
linreg = LinearRegression().fit(x_poly, y_train)
#x_orig = np.linspace(0, 20, 100)
y_pred = linreg.predict(poly.fit_transform(np.linspace(0, 20, 100).reshape(-1,1)))
y_pred = y_pred.T
return(y_pred.reshape(-1,1))
ans = poly_y(1)
for i in degs:
temp = poly_y(i)
ans = np.vstack([ans, temp])
return ans
model()
Image of output:
Combining the comments to your question, with a brief explanation:
You're currently doing
ans = poly_y(1)
for i in degs:
temp = poly_y(i)
ans = np.vstack([ans, temp])
You set ans to the result for a degree of one, then loop through all degrees and stack those to ans. But, all degrees include 1, so you get degree 1 twice, and end up with a 500 by 1 array. Thus, you can remove the first line. Then, you have this loop where you repeatedly stack to ans, which can be done in one go, using a list comprehension (e.g., with [poly_y(deg) for deg in degs]). Stacking that results in a 400 by 1 array, which is not what you want. You could reshape that, or you could use hstack. The latter returns a 100 by 4 array; to get a 4 by 100 array, just transpose that.
So the final solution would be to replace the above four lines with
ans = np.hstack([poly_y(deg) for deg in degs]).T
(and if you want to get more fancy, replace those lines and the return ans line with
return np.hstack([poly_y(deg) for deg in degs]).T
)
I have two numpy arrays
import numpy as np
temp_1 = np.array([['19.78018766'],
['19.72487359'],
['19.70280336'],
['19.69589641'],
['19.69746018']])
temp 2 = np.array([['43.8'],
['43.9'],
['44'],
['44.1'],
['44.2']])
and I am preparing X = np.stack((temp_1,temp_2), axis=-1)
which looks something like this
X = [[['19.78018766' '43.8']]
[['19.72487359' '43.9']]
[['19.70280336' '44']]
[['19.69589641' '44.1']]
[['19.69746018' '44.2']]]
I have another variable Y which is also a numpy array
Y = np.array([['28.78'],
['32.72'],
['15.70'],
['32.69'],
['55.69']])
I am trying to run the RandomforestRegressor model
where
from sklearn.ensemble import RandomForestRegressor
clf = RandomForestRegressor()
clf.fit(X,Y)
However, it is giving me this error
ValueError: Found array with dim 3. Estimator expected <= 2.
This happens because X and Y shapes are different (5, 1, 2) != (5,1).
Just reshape your X data to the number of samples you have
# In this example 5 samples
X = X.reshape(5, 2)
I have some issues with Linear Regression, I just used a simple sample and I still get error, don't know what I'm doing wrong.
Here's the code:
x = [1,1,2,3,1,1,2,0,4,1]
x = np.array(x)
x = np.reshape(1,-1)
y = [1.24,0.88,0.88,1.31,1.36,0.79,0.79,0.79,1.36,1.36]
y = np.array(y)
y = np.reshape(1,-1)
lin_reg = LinearRegression()
lin_reg.fit(x,y)
"ValueError: Expected 2D array, got 1D array instead:
array=[1].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample."
The error says what you should do in this case.
Just use .reshape(-1, 1) instead of .reshape(1,-1).
Do it only for x and the problem is solved.
x = [1,1,2,3,1,1,2,0,4,1]
x = np.array(x).reshape(-1, 1) # Edited line
y = [1.24,0.88,0.88,1.31,1.36,0.79,0.79,0.79,1.36,1.36]
lin_reg = LinearRegression()
lin_reg.fit(x,y)
you called the reshape function wrong.
If you want to reshape the x or y matrices, you should call like:
x = x.reshape(1, -1)
or combining with one line before:
x = np.array(x).reshape(1, -1)
If you only call np.reshape(), nothing would happen to your data.
Found this snippet code to basically increase my negative reviews to better train my model. When I went to run it through I am getting this error. Looks to be around the idx. Does anyone have a good solution for this?
Passing list-likes to .loc or [] with any missing labels is no longer supported
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike'
from sklearn.utils import shuffle
import numpy as np
labels, num = np.unique(y_train, return_counts=True)
#print(labels)
u=min(labels)
intial = 1
#set the desired size of the oversampled cells
maxcnt = np.int(max(num)/2)
for labl, n in zip(labels, num):
x0 = X_train[y_train==labl]
y0 = y_train[y_train==labl]
# print (x0)
remain = maxcnt
print (remain)
while remain >= n;
if label == u and initial == 1;
X_Train = x0
y_Train = y0
remain -= n
initial = 0
else:
X_Train = np.concatenate((X_Train, x0), axis=0)
y_Train = np.concatenate((y_Train, y0), axis=0)
remain -= n
if remain > 0 and remain < n:
idx = np.random.choice(np.arange(len(y0)), remain, replace=False)
#print(idx)
X_Train = np.concatenate((X_Train, x0[idx]), axis=0)
y_Train = np.concatenate((y_Train, y0[idx]), axis=0)
remain -= n
X_Train, y_Train = shuffle(X_Train, y_Train)
np.unique(X_Train, return_counts=True)