Simple Linear Regression issue in Python - python

I have this data:
and I am trying to do a simple linear regression model on it.
Here is my code:
from sklearn.linear_model import LinearRegression
X = df[['Date']]
y = df['ACP Cleaning']
model = LinearRegression()
model.fit(X, y)
X_predict = [['2021-1-1']]
y_predict = model.predict(X_predict)
and this is my error:
ValueError: Unable to convert array of bytes/strings into decimal
numbers with dtype='numeric'

Linear Regression works with numbers, not strings.
You must pre-process your data in order to match the input of the model.
One way to do it is to parse the string and convert it to timestamp:
import datetime
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
return d.timestamp()
X = df[['Date']].apply(process_date)
The same must be done to the data you want to predict.
Update: If your dataset's datatype is correct, then the problem is with the data you are trying to use for prediction (you cannot predict a string).
The following is a complete working example. Pay close attention to the processing done to the X_predict variable.
import datetime
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
rng = pd.date_range('2015-02-24', periods=5, freq='3A')
df = pd.DataFrame({ 'Date': rng, 'Val' : np.random.randn(len(rng))})
print(df.head())
X = np.array(df['Date']).reshape(-1,1)
y = df['Val']
model = LinearRegression()
model.fit(X, y)
def process_date(date_str):
d = datetime.datetime.strptime(date_str, '%Y-%m-%d')
# return array
return [d.timestamp()]
X_predict = ['2021-1-1']
X_predict = list(map(process_date, X_predict))
y_predict = model.predict(X_predict)
y_predict
Returns:
Date Val
0 2015-12-31 -0.110503
1 2018-12-31 -0.621394
2 2021-12-31 -1.030068
3 2024-12-31 1.221146
4 2027-12-31 -0.327685
array([-2.6149628])
Update: I used your data to create a csv file:
Date,Val
1-1-2020, 90404.71
2-1-2020, 69904.71
...
And then I loaded with pandas. Everything looks good to me:
def process_date(date_str):
# the date format is month-day-year
d = datetime.datetime.strptime(date_str, '%m-%d-%Y')
return d.timestamp()
df = pd.read_csv("test.csv")
df['Date'] = df['Date'].apply(process_date)
df.head()
Output:
Date Val
0 1.577848e+09 90404.710
1 1.580526e+09 69904.710
2 1.583032e+09 98934.112
3 1.585710e+09 77084.430
4 1.588302e+09 35877.420
Extracting features:
# must reshape 'cause we have only one feature
X = df['Date'].to_numpy().reshape(-1,1)
y = df['Val'].to_numpy()
model = LinearRegression()
model.fit(X, y)
Predicting:
X_predict = ['1-1-2021', '2-1-2021']
X_predict = np.array(list(map(process_date, X_predict)))
X_predict = X_predict.reshape(-1, 1)
y_predict = model.predict(X_predict)
y_predict
Output:
array([55492.2660361 , 53516.12292932])
This is a good prediction. You can use matplotlib to plot your data and convince yourself:
import matplotlib.pyplot as plt
%matplotlib inline
plt.plot(df['Date'], df['Val'])
plt.show()

Linear Regression needs your arrays to be of numeric type, since you have dates that are stored as strings in your X array, Linear Regression won't work as you expect.
You can convert the X array to numeric type by counting the number of days since the beginning date. You can try something like this in your DataFrame:
df.Date = (df.Date - df.Date[0]).days
And then you can continue as you were doing.
I have assumed that the dates in your Date column are in the datetime format, else you would need to convert it first.

Related

Python statsmodels – ValueError: how to create variable in range 0 to 1?

Code:
import numpy as np
import pandas as pd
import statsmodels.api as sm
sacramento = pd.read_csv("sacramento.csv")
X = sacramento[["beds", "sqft", "price"]]
Y = sacramento["baths"]
X = sm.add_constant(X)
model = sm.Logit(Y, X).fit()
predictions = model.predict(X)
print_model = model.summary()
print(print_model)
print(mod.params.round(2))
print(mod.pvalues.round(2))
print('The smallest p-value is for sqft')
The problem I have is with the "You will need to create a new variable from baths, and it should make it such that those observations of 1 bath correspond to a value of 0, and those with more than 1 bath correspond to a 1." instruction.
I really do not know how to do that. I know that it causes a ValueError: endog must be in the unit interval.
Link to the csv file: https://drive.google.com/file/d/1A3LQ2vZ9IUkv_2HkqP8c2sCQGAvdII-r/view?usp=sharing
Can you try this?
sacramento["baths"] = sacramento["baths"].apply(lambda x: 0 if x== 1 else 1)

ValueError: Unable to coerce to Series, length must be 1: given n

I have been trying to use RF regression from scikit-learn, but I’m getting an error with my standard (from docs and tutorials) model. Here is the code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
db = pd.read_excel('/home/artyom/myprojects//valuevo/field2019/report/segs_inventar_dataframe/excel_var/invcents.xlsx')
age = df[['AGE_1', 'AGE_2', 'AGE_3', 'AGE_4', 'AGE_5']]
hight = df [['HIGHT_','HIGHT_1', 'HIGHT_2', 'HIGHT_3', 'HIGHT_4', 'HIGHT_5']]
diam = df[['DIAM_', 'DIAM_1', 'DIAM_2', 'DIAM_3', 'DIAM_4', 'DIAM_5']]
za = df[['ZAPSYR_', 'ZAPSYR_1', 'ZAPSYR_2', 'ZAPSYR_3', 'ZAPSYR_4', 'ZAPSYR_5']]
tova = df[['TOVARN_', 'TOVARN_1', 'TOVARN_2', 'TOVARN_3', 'TOVARN_4', 'TOVARN_5']]
#df['average'] = df.mean(numeric_only=True, axis=1)
df['meanage'] = age.mean(numeric_only=True, axis=1)
df['meanhight'] = hight.mean(numeric_only=True, axis=1)
df['mediandiam'] = diam.mean(numeric_only=True, axis=1)
df['medianza'] = za.mean(numeric_only=True, axis=1)
df['mediantova'] = tova.mean(numeric_only=True, axis=1)
unite = df[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media','meanhight']].dropna()
from sklearn.model_selection import train_test_split as ttsplit
df_copy = unite.copy()
trainXset = df_copy[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media']]
trainYset = df_copy [['meanhight']]
trainXset_train, trainXset_test, trainYset_train, trainYset_test = ttsplit(trainXset, trainYset, test_size=0.3) # 70% training and 30% test
rf = RandomForestRegressor(n_estimators = 100, random_state = 40)
rf.fit(trainXset_train, trainYset_train)
predictions = rf.predict(trainXset_test)
errors = abs(predictions - trainYset_test)
mape = 100 * (errors / trainYset_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
But output doesn’t look ok:
---> 24 errors = abs(predictions - trainYset_test)
25 # Calculate mean absolute percentage error (MAPE)
26 mape = 100 * (errors / trainYset_test)
..... somemore track
ValueError: Unable to coerce to Series, length must be 1: given 780
How can I fix it? 780 is the shape of trainYset_test. I’m not asking for a solution (i.e. write code for me), but for advice on why this error happened. I followed everything as in tutorials.
by seeing in error it is cleared that, the array has to have the shape of one ,
so use reshape to make it in correct shape,
predictions=predictions.reshape(780,1)
I solved this by making sure the predictions were the same data type as the actual data. In my case, it was:
MSE = (sum((y_test-predictions)**2))/(len(newX)-len(newX.columns))
I resolved this by casting y_test to be a numpy array:
MSE = (sum((np.array(y_test)-predictions)**2))/(len(newX)-len(newX.columns))

how to fix ''Found input variables with inconsistent numbers of samples: [219, 247]''

As title says when running the following code i get a trouble Found input variables with inconsistent numbers of samples: [219, 247], i have read that the problem should be on the np.array set for X and y, but i cannot address the problem because there is a price for every date so i dont get why it is happening, any help will be appreciated thanks!
import pandas as pd
import quandl, math, datetime
import numpy as np
from sklearn import preprocessing, svm, model_selection
from sklearn.linear_model import LinearRegression
import matplotlib as plt
from matplotlib import style
style.use('ggplot')
df = quandl.get("NASDAQOMX/XNDXT25NNR", authtoken='myapikey')
df = df[['Index Value','High','Low','Total Market Value']]
df['HL_PCT'] = (df['High'] - df['Low']) / df['Index Value'] * 100.0
df = df[['Low','High','HL_PCT']]
forecast_col = 'High'
df.fillna(-99999, inplace=True)
forecast_out = int(math.ceil(0.1*len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
df.dropna(inplace= True)
X = np.array(df.drop(['label'],1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
y=np.array(df['label'])
#X= X[:-forecast_out+1]
df.dropna(inplace=True)
y= np.array(df['label'])
X_train, X_test, y_train, y_test= model_selection.train_test_split(X,
y,test_size=0.2)
clf= LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
forecast_set= clf.predict(X_lately)
print(forecast_set, accuracy, forecast_out)
df['Forecast'] = np.nan
last_data= df.iloc[-1].name
last_unix= last_date.timestamp()
one_day=86400
next_unix= last_unix + one_day
for i in forecast_set:
next_date= datetime.datetime.fromtimestamp(next_unix)
next_unix += one_day
df.loc[next_date]= [np.nan for _ in range(len(df.columns) -1)] +
[i]
df['High'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
the expected result should be a plot of future price prediction for that ticker but besides that it is throwing that error 'Found input variables with inconsistent numbers of samples: [219, 247]'.
Your problem lies in these two lines extracted from your code:
X = X[:-forecast_out]
y= np.array(df['label'])
You're subsetting X, but leaving y "as it is".
You may check that shapes differ indeed by:
X.shape, y.shape
Change the last line to:
y= np.array(df[:-forecast_out]['label'])
and you're fine.
Note as well, instead of these repetitive lines:
y=np.array(df['label'])
#X= X[:-forecast_out+1]
df.dropna(inplace=True) # there is no na at this point
y= np.array(df['label'])
the following line (solution to your problem) is just enough:
y= np.array(df[:-forecast_out]['label'])

Prophet Python ValueError: Regressor missing from dataframe

I am trying to use the latest (2nd) 0.3 version of the Prophet package for Python.
My model should include an exogenous regressor, but I receive a ValueError stating that the indeed existing regressor is missing from dataframe. Is this a bug or what am I doing wrong?
#Random Dataset Preparation
import random
random.seed(a=1)
df = pandas.DataFrame(data = None, columns = ['ds', 'y', 'ex'], index = range(50))
datelist = pandas.date_range(pandas.datetime.today(), periods = 50).tolist()
y = numpy.random.normal(0, 1, 50)
ex = numpy.random.normal(0, 2, 50)
df['ds'] = datelist
df['y'] = y
df['ex'] = ex
#Model
prophet_model = Prophet(seasonality_prior_scale = 0.1)
Prophet.add_regressor(prophet_model, 'ex')
prophet_model.fit(df)
prophet_forecast_step = prophet_model.make_future_dataframe(periods=1)
#Result-df
prophet_x_df = pandas.DataFrame(data=None, columns=['Date_x', 'Res'], index = range(int(len(y))))
#Error
prophet_x_df.iloc[0,1] = prophet_model.predict(prophet_forecast_step).iloc[0,0]
You need to first create a column with the regressor value which need to be present in both the fitting and prediction dataframes.
Refer prophet docs
make_future_dataframe generates a dataframe with ds column only.
You need to add 'ex' column to prophet_forecast_step dataframe in order to use it as a regressor

Sorting csv data by headers but getting IndexError

It seems that my code fails when I try to set what headers/columns of data I want to use giving me an index error when trying to parse headers
import pandas as pd
import quandl
import math, datetime
import numpy as np
from sklearn import preprocessing , cross_validation, svm
from sklearn.linear_model import LinearRegression
import scipy
import matplotlib.pyplot as plt
from matplotlib import style
import pickle
style.use('ggplot')
df = pd.read_csv('convertcsv.csv',sep='\t')
df = np.array(df)
print(df)
df = df[['Open','High','Low','Close','Volume (BTC)']]
print("ok")
df['HL_PCT'] = (df['High'] - df['Close']) / df['Close'] * 100.0
df['PCT_change'] = (df['Close'] - df['Open']) / df['Open'] * 100.0
df = df[['Close','HL_PCT','PCT_change','Volume (BTC)']]
forecast_col = 'Close'
df.fillna(-999999, inplace=True)
forecast_out = int(math.ceil(0.01*len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
X = np.array(df.drop(['label'],1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out:]
df.dropna(inplace=True)
y = np.array(df['label'])
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y,
test_size=0.2)
clf = LinearRegression(n_jobs=-1)
clf.fit(X_train, y_train)
with open('linearregression.pickle','wb') as f:
pickle.dump(clf, f)
pickle_in = open('linearregression.pickle','rb')
clf =pickle.load(pickle_in)
accuracy = clf.score(X_test,y_test)
print(accuracy)
forecast_set = clf.predict(X_lately)
df['Forecast'] = np.nan
last_date = df.iloc[-1].name
last_unix = last_date.timestamp()
one_day = 86400
next_unix = last_unix + one_day
for i in forecast_set:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += one_day
df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)] + [i]
df['Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Price')
plt.pause(1)
plt.show()
print("we done?")`
...
I cant seem to figure out what I am doing wrong, it worked with the previous data set I was using, if it helps here is the format of the csv file that I was pulling from:
Timestamp,Open,High,Low,Close,Volume (BTC),Volume (Currency),Weighted Price
2017-09-30 00:00:00,4162.04,4177.63,4154.28,4176.08,114.81,478389.12,4166.96
2017-09-30 01:00:00,4170.84,4224.6,4170.84,4208.14,348.45,1463989.18,4201.4
I am not too experienced with this sort of stuff, and I tried to find other people with the same error but everyone was having a different sort of problem, I can include more data if it is needed.
You're converting your dataframe to a numpy array with df = np.array(df).
Don't expect a numpy array to function as a pandas dataframe.
Remove
df = np.array(df)
and you should be able to slice your matrix by column name with
df = df[['Open','High','Low','Close','Volume (BTC)']]

Categories

Resources