So, I have code that works for knn.predict() if I have data that has 1 feature to predict the next outcome. To put this into context, I have stock data (Open, High, Low, Close) where I use "Open" as "X" data and "Close" as "Y" data and knn.predict will predict the next value of Y.
When I try to use "Open, High, Low" columns (3 features) for my X data, I get the following error:
File "sklearn\neighbors\binary_tree.pxi", line 1294, in sklearn.neighbors.kd_tree.BinaryTree.query
ValueError: query data dimension must match training data dimension
I believe it's because of my X.shape and Y.shape where X is not the same size as Y but I don't understand how to fix it. How do you use KNN for multifeature analysis if X and Y must be the same size?
Some of the Code:
df = df[['Date','Time', 'Open', 'High', 'Low', 'Close']]
df.head()
# Predictor Variables
df['Open'] = df.Open
df['High'] = df.High
df['Low'] = df.Low
df['Close'] = df.Close
df = df.dropna()
#Data = np.delete(arr = df, obj=0, axis = 0)
X = np.array(df.ix[:, 2:6])
#X.head()
print X.shape
# Target Variable
Y = np.where(df['Close'].shift(-1)>df['Close'],1,-1)
#print (Y)
#Predict
u = df['Close'].iloc[-1]
#print u
new_prediction = knn.predict(u)
print new_prediction
For training, you're using
X = np.array(df.ix[:, 2:6])
i.e., a matrix with 6 - 2 = 4 columns, meaning that the neighbors are 4-tuples.
For predicting, you're using
u = df['Close'].iloc[-1]
which is a scalar.
The nearest neighbor is undefined, and sklearn is very unhappy.
Related
I'm trying to convert the dataset into a classification dataset by:
Step 1: Split the range of target values into three equal parts - low, mid, and high.
Step 2: Reassign the target values into into three categorical values 0, 1, and 2, representing low, mid and high range of values, respectively.
I tried different approach by using the method that were suggesting in this post: How to automatically categorise data in panda dataframe? and didn't get the result I wanted. Any suggestion?
Dataset in question:
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
let's find the lowest and give him the highest value (100) than the max(y) (50 in your example), we repeat this until we have done this for at 33% of your y, and we repeat this 2 times with another different value higher than max(y).
Then we use a function to modify your 100,200 and 300 to 0,1,2
from sklearn.datasets import load_boston
data = load_boston()
X = data.data
y = data.target
y = list(y)
print(y)
for i in range(len(y)):
index = y.index(min(y))
if i < len(y)/3:
y[index]=100
elif i > len(y)/3 and i < 2*(len(y)/3):
y[index]=200
else:
y[index]=300
def split_in_3(y):
if y == 100:
return 0
elif y == 200:
return 1
else:
return 2
y2 = map(split_in_3,y)
print(list(y2))
I have this data:
and I am trying to do a simple linear regression model on it.
Here is my code:
from sklearn.linear_model import LinearRegression
X = df[['Date']]
y = df['ACP Cleaning']
model = LinearRegression()
model.fit(X, y)
X_predict = [['2021-1-1']]
y_predict = model.predict(X_predict)
and this is my error:
ValueError: Unable to convert array of bytes/strings into decimal
numbers with dtype='numeric'
Linear Regression works with numbers, not strings.
You must pre-process your data in order to match the input of the model.
One way to do it is to parse the string and convert it to timestamp:
import datetime
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
return d.timestamp()
X = df[['Date']].apply(process_date)
The same must be done to the data you want to predict.
Update: If your dataset's datatype is correct, then the problem is with the data you are trying to use for prediction (you cannot predict a string).
The following is a complete working example. Pay close attention to the processing done to the X_predict variable.
import datetime
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
print(df.head())
X = np.array(df['Date']).reshape(-1,1)
y = df['Val']
model = LinearRegression()
model.fit(X, y)
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
# return array
return [d.timestamp()]
X_predict = ['2021-1-1']
X_predict = list(map(process_date, X_predict))
y_predict = model.predict(X_predict)
y_predict
Returns:
Date Val
0 2015-12-31 -0.110503
1 2018-12-31 -0.621394
2 2021-12-31 -1.030068
3 2024-12-31 1.221146
4 2027-12-31 -0.327685
array([-2.6149628])
Update: I used your data to create a csv file:
Date,Val
1-1-2020, 90404.71
2-1-2020, 69904.71
...
And then I loaded with pandas. Everything looks good to me:
def process_date(date_str):
# the date format is month-day-year
d = datetime.datetime.strptime(date_str, '%m-%d-%Y')
return d.timestamp()
df = pd.read_csv("test.csv")
df['Date'] = df['Date'].apply(process_date)
df.head()
Output:
Date Val
0 1.577848e+09 90404.710
1 1.580526e+09 69904.710
2 1.583032e+09 98934.112
3 1.585710e+09 77084.430
4 1.588302e+09 35877.420
Extracting features:
# must reshape 'cause we have only one feature
X = df['Date'].to_numpy().reshape(-1,1)
y = df['Val'].to_numpy()
model = LinearRegression()
model.fit(X, y)
Predicting:
X_predict = ['1-1-2021', '2-1-2021']
X_predict = np.array(list(map(process_date, X_predict)))
X_predict = X_predict.reshape(-1, 1)
y_predict = model.predict(X_predict)
y_predict
Output:
array([55492.2660361 , 53516.12292932])
This is a good prediction. You can use matplotlib to plot your data and convince yourself:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(df['Date'], df['Val'])
plt.show()
Linear Regression needs your arrays to be of numeric type, since you have dates that are stored as strings in your X array, Linear Regression won't work as you expect.
You can convert the X array to numeric type by counting the number of days since the beginning date. You can try something like this in your DataFrame:
df.Date = (df.Date - df.Date[0]).days
And then you can continue as you were doing.
I have assumed that the dates in your Date column are in the datetime format, else you would need to convert it first.
I'm working on a project using Python(3.6) and Sklearn.I have done classifications but when I try to apply it for reshaping in order to use it with fit method of sklearn it returns an error.
Here's what I have tried:
# Get all the columns from dataframe
columns = data.columns.tolist()
# Filter the columns to remove data we don't want
columns = [c for c in columns if c not in ["Class"] ]
# store the variables we want to predicting on
target = "Class"
X = data.drop(target, 1)
Y = data[target]
# Print the shapes of X & Y
print(X.shape)
print(Y.shape)
# define a random state
state = 1
# define the outlier detection method
classifiers = {
"Isolation Forest": IsolationForest(max_samples=len(X),
contamination=outlier_fraction,
random_state=state),
"Local Outlier Factor": LocalOutlierFactor(
n_neighbors = 20,
contamination = outlier_fraction)
}
# fit the model
n_outliers = len(Fraud)
for i, (clf_name, clf) in enumerate(classifiers.items()):
# fit te data and tag outliers
if clf_name == "Local Outlier Factor":
y_pred = clf.fit_predict(X)
scores_pred = clf.negative_outlier_factor_
else:
clf.fit(X)
scores_pred = clf.decision_function(X)
y_pred = clf.predict(X)
# Reshape the prediction values to 0 for valid and 1 for fraudulent
y_pred[y_pred == 1] = 0
y_pred[y_pred == -1] = 1
n_errors = (y_pred != Y).sum()
# run classification metrics
print('{}:{}'.format(clf_name, n_errors))
print(accuracy_score(Y, y_pred ))
print(classification_report(Y, y_pred ))
Then it returns the following error:
ValueError: could not convert string to float: '301.48 Change: $0.00'
and it's pointed to `clf.fit(X)` line.
What have I configured wrong?
We can convert out dataset to numeric data values on the base of their uniqueness and you can also drop un-necessary columns form the dataset.
Here's how you can try that:
df_full = pd.read_excel('input/samp.xlsx', sheet_name=0,)
df_full = df_full[df_full.filter(regex='^(?!Unnamed)').columns]
df_full.drop(['paymentdetails',], 1, inplace=True)
df_full.drop(['timestamp'], 1, inplace=True)
# Handle non numaric data
def handle_non_numaric_data(df_full):
columns = df_full.columns.values
for column in columns:
text_digit_vals = {}
def convert_to_int(val):
return text_digit_vals[val]
if df_full[column].dtype != np.int64 and df_full[column].dtype != np.float64:
column_contents = df_full[column].values.tolist()
unique_elements = set(column_contents)
x = 0
for unique in unique_elements:
if unique not in text_digit_vals:
text_digit_vals[unique] = x
x+=1
df_full[column] = list(map(convert_to_int, df_full[column]))
return df_full
I try to use scikit-learn SVM to predict whether a stock from S&P500 beats the index or not.
I have the 'sample' file from which I extract the features X and the labels (beats the index or doesn't beat it) Y.
When I tried it the first time (without reshaping X) I got the the following depreciation error:
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17
and will raise ValueError in 0.19. Reshape your data either using
X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1)
if it contains a single sample.
Consequently I tried the reshaping of X according to the recommendation and also to some forum posts.
Now however I get the following value error that X and Y don't have the same shape.
ValueError: X and y have incompatible shapes.
X has 4337 samples, but y has 393.
Below you can see the shapes of X and Y before reshaping:
('Shape of X = ', (493, 9))
('Shape of Y = ', (493,))
and after reshaping:
('Shape of X = ', (4437, 1))
('Shape of Y = ', (493,))
I also tried to reshape so that I get the (493,9) shape, but also this didn't work as I got the following error.
ValueError: total size of new array must be unchanged.
I posted below the code to extract the features and labels from the pandas DataFrame and and the SVM analysis:
Feature & Label selection:
X = np.array(sample[features].values)
X = preprocessing.scale(X)
X = np.array(X)
X = X.reshape(-1,1)
Y = sample['status'].values.tolist()
Y = np.array(Y)
Z = np.array(sample[['changemktvalue', 'benchmark']])
SVM testing:
test_size = 50
invest_amount = 1000
total_invests = 0
if_market = 0
if_strat = 0
clf = svm.SVC(kernel="linear", C= 1.0)
clf.fit(X[:-test_size],Y[:-test_size])
correct_count = 0
for x in range(1, test_size+1):
if clf.predict(X[-x])[0] == Y[-x]:
correct_count += 1
if clf.predict(X[-x])[0] == 1:
invest_return = invest_amount + (invest_amount * (Z[-x][0]/100)) #zeroth element of z
market_return = invest_amount + (invest_amount * (Z[-x][1]/100)) #marketsp500 is at pos 1
total_invests += 1
if_market += market_return
if_strat += invest_return
print("Accuracy:", (float(correct_count)/test_size) * 100.00)
Would be great if you have any inputs on how to solve this.
You should not be reshaping X to (-1, 1). In fact the error is in your call to the predict method.
Change
clf.predict(X[-x])[0]
to
clf.predict(X[-x].reshape((-1, 9)))[0]
I'd like to find the standard deviation and confidence intervals for an out-of-sample prediction from an OLS model.
This question is similar to Confidence intervals for model prediction, but with an explicit focus on using out-of-sample data.
The idea would be for a function along the lines of wls_prediction_std(lm, data_to_use_for_prediction=out_of_sample_df), that returns the prstd, iv_l, iv_u for that out of sample dataframe.
For instance:
import pandas as pd
import random
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.predstd import wls_prediction_std
df = pd.DataFrame({"y":[x for x in range(10)],
"x1":[(x*5 + random.random() * 2) for x in range(10)],
"x2":[(x*2.1 + random.random()) for x in range(10)]})
out_of_sample_df = pd.DataFrame({"x1":[(x*3 + random.random() * 2) for x in range(10)],
"x2":[(x + random.random()) for x in range(10)]})
formula_string = "y ~ x1 + x2"
lm = smf.ols(formula=formula_string, data=df).fit()
# Prediction works fine:
print(lm.predict(out_of_sample_df))
# I can also get std and CI for in-sample data:
prstd, iv_l, iv_u = wls_prediction_std(lm)
print(prstd)
# I cannot figure out how to get std and CI for out-of-sample data:
try:
print(wls_prediction_std(lm, exog= out_of_sample_df))
except ValueError as e:
print(str(e))
#returns "ValueError: wrong shape of exog"
# trying to concatenate the DFs:
df_both = pd.concat([df, out_of_sample_df],
ignore_index = True)
# Only returns results for the data from df, not from out_of_sample_df
lm2 = smf.ols(formula=formula_string, data=df_both).fit()
prstd2, iv_l2, iv_u2 = wls_prediction_std(lm2)
print(prstd2)
It looks like the problem is in the format of the exog parameter. This method is 100% stolen from this workaround by github user thatneat. It is necessary because of this bug.
def transform_exog_to_model(fit, exog):
transform=True
self=fit
# The following is lifted straight from statsmodels.base.model.Results.predict()
if transform and hasattr(self.model, 'formula') and exog is not None:
from patsy import dmatrix
exog = dmatrix(self.model.data.orig_exog.design_info.builder,
exog)
if exog is not None:
exog = np.asarray(exog)
if exog.ndim == 1 and (self.model.exog.ndim == 1 or
self.model.exog.shape[1] == 1):
exog = exog[:, None]
exog = np.atleast_2d(exog) # needed in count model shape[1]
# end lifted code
return exog
transformed_exog = transform_exog_to_model(lm, out_of_sample_df)
print(transformed_exog)
prstd2, iv_l2, iv_u2 = wls_prediction_std(lm, transformed_exog, weights=[1])
print(prstd2)
Additionally you can try to use the get_prediction method.
predictions = result.get_prediction(out_of_sample_df)
predictions.summary_frame(alpha=0.05)
This returns the confidence and prediction interval. I found the summary_frame() method buried here and you can find the get_prediction() method here. You can change the significance level of the confidence interval and prediction interval by modifying the "alpha" parameter.