What kind of model can i use to forecast this data? - python

This is the dataset that I have of some orders each week. I want to predict the orders for the rest of the year. I've tried building an ARIMA model and it doesn't work.
Is there any other model that I can try for such a small dataset? Maybe a HMM or try fitting a polynomial curve to it or build a time series LSTM?
FW Order
1 6
2 45
3 59
4 60
5 50
6 115
7 23
8 44
9 164
10 8
11 30
12 20
13 0
14 50
15 60
16 0
17 50
18 30
19 115
20 75
21 54
22 29
23 124
24 32
25 28

Here's a plot of your data. Your main problem is that there isn't really enough data for any model to give you meaningful predictions with statistical significance. Your data mostly just looks like white noise around a mean, so you'd represent it with:
x_t = mu + e
where e is an error term representing white noise.
There is a hint of mean reversion, so you could try an Ornstein Uhlenbeck model:
dx_t = theta * (mu - x_t-1) dt + sigma * dW_t
https://en.wikipedia.org/wiki/Ornstein%E2%80%93Uhlenbeck_process
Here's it coded up. Orange line is the prediction. Again, the prediction isn't great, but you probably won't find much better without more data.
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
def least_squares_naive(s, delta=1.0):
y = s.diff().iloc[1:]
x = s.shift(1)[1:]
res = sm.OLS(y, sm.add_constant(x)).fit()
b, a = res.params
residual_df = y - (a * x + b)
se = residual_df.std(ddof=2)
lambda_ = -a / delta
mu_ = b / (lambda_ * delta)
sigma_ = se / (delta ** 0.5)
return mu_, lambda_, sigma_
list = [6,45,59,60,50,115,23,44,164,8,30,20,0,50,60,0,50,30,115,75,54,29,124,32,28]
s = pd.Series(list)
mu_, lambda_, sigma_ = least_squares_naive(s)
dx = -lambda_ * (s - mu_)
pred = (s + dx).shift()
diff = s.diff(1).dropna()
s.plot()
pred.plot()
plt.show()

Related

How to create a sine curve of positive part only between two integer values

I have to generate a sine curve of the positive part only between two values. The idea is my variable say monthly-averaged RH, which has 12 data points in a year (i.e. time series) varies between 50 and 70 in a sinusoidal way. The first and the last data points end at 50.
Can anyone help how I can generate this curve/function for the curve to get values of all intermediate data points? I am trying to use numpy/scipy for this.
Best,
Debayan
This is basic trig.
import math
for i in range(12):
print( i, 50 + 20 * math.sin( math.pi * i / 12 ) )
Output:
0 50.0
1 55.17638090205041
2 60.0
3 64.14213562373095
4 67.32050807568876
5 69.31851652578136
6 70.0
7 69.31851652578136
8 67.32050807568878
9 64.14213562373095
10 60.0
11 55.17638090205042

Given a SciPy discrete random variable distribution, how do I round a number to the closest value in that distribution? [duplicate]

What I ultimately want to do is round the expected value of a discrete random variable distribution to a valid number in the distribution. For example if I am drawing evenly from the numbers [1, 5, 6], the expected value is 4 but I want to return the closest number to that (ie, 5).
from scipy.stats import *
xk = (1, 5, 6)
pk = np.ones(len(xk))/len(xk)
custom = rv_discrete(name='custom', values=(xk, pk))
print(custom.expect())
# 4.0
def round_discrete(discrete_rv_dist, val):
# do something here
return answer
print(round_discrete(custom, custom.expect()))
# 5.0
I don't know apriori what distribution will be used (ie might not be integers, might be an unbounded distribution), so I'm really struggling to think of an algorithm that is sufficiently generic. Edit: I just learned that rv_discrete doesn't work on non-integer xk values.
As to why I want to do this, I'm putting together a monte-carlo simulation, and want a "nominal" value for each distribution. I think that the EV is the most physically appropriate rather than the mode or median. I might have values in the downstream simulation that have to be one of several discrete choices, so passing a value that is not within that set is not acceptable.
If there's already a nice way to do this in Python that would be great, otherwise I can interpret math into code.
Here is R code that I think will do what you want, using Poisson data to illustrate:
set.seed(322)
x = rpois(100, 7) # 100 obs from POIS(7)
a = mean(x); a
[1] 7.16 # so 7 is the value we want
d = min(abs(x-a)); d # min distance btw a and actual Pois val
[1] 0.16
u = unique(x); u # unique Pois values observed
[1] 7 5 4 10 2 9 8 6 11 3 13 14 12 15
v = u[abs(u-a)==d]; v # unique val closest to a
[1] 7
Hope you can translate it to Python.
Another run:
set.seed(323)
x = rpois(100, 20)
a = mean(x); a
[1] 20.32
d = min(abs(x-a)); d
[1] 0.32
u = unique(x)
v = u[abs(u-a)==d]; v
[1] 20
x
[1] 17 16 20 23 23 20 19 23 21 19 21 20 22 25 13 15 19 19 14 27 19 30 17 19 23
[26] 16 23 26 33 16 11 23 14 21 24 12 18 20 20 19 26 12 22 24 20 22 17 23 11 19
[51] 19 26 17 17 11 17 23 21 26 13 18 28 22 14 17 25 28 24 16 15 25 26 22 15 23
[76] 27 19 21 17 23 21 24 23 22 23 18 25 14 24 25 19 19 21 22 16 28 18 11 25 23
u
[1] 17 16 20 23 19 21 22 25 13 15 14 27 30 26 33 11 24 12 18 28
Figured it out, and tested it working. If I plug my value X into the cdf, then I can plug that probability P = cdf(X) into the ppf. The values at ppf(P +- epsilon) will give me the closest values in the set to X.
Or more geometrically, for a discrete pmf, the point (X,P) will lie on a horizontal portion of the corresponding cdf. When you invert the cdf, (P,X) is now on a vertical section of the ppf. Taking P +- eps will give you the 2 nearest flat portions of the ppf connected to that vertical jump, which correspond to the valid values X1, X2. You can then do a simple difference to figure out which is closer to your target value.
import numpy as np
eps = np.finfo(float).eps
ev = custom.expect()
p = custom.cdf(ev)
ev_candidates = custom.ppf([p - eps, p, p + eps])
ev_candidates_distance = abs(ev_candidates - ev)
ev_closest = ev_candidates[np.argmin(ev_candidates_distance)]
print(ev_closest)
# 5.0
Terms:
pmf - probability mass function
cdf - cumulative distribution function (cumulative sum of the pdf)
ppf - percentage point function (inverse of the cdf)
eps - epsilon (smallest possible increment)
Would the function ceil from the math library help? For example:
from math import ceil
print(float(ceil(3.333333333333333)))

python sklearn multiple linear regression display r-squared

I calculated my multiple linear regression equation and I want to see the adjusted R-squared. I know that the score function allows me to see r-squared, but it is not adjusted.
import pandas as pd #import the pandas module
import numpy as np
df = pd.read_csv ('/Users/jeangelj/Documents/training/linexdata.csv', sep=',')
df
AverageNumberofTickets NumberofEmployees ValueofContract Industry
0 1 51 25750 Retail
1 9 68 25000 Services
2 20 67 40000 Services
3 1 124 35000 Retail
4 8 124 25000 Manufacturing
5 30 134 50000 Services
6 20 157 48000 Retail
7 8 190 32000 Retail
8 20 205 70000 Retail
9 50 230 75000 Manufacturing
10 35 265 50000 Manufacturing
11 65 296 75000 Services
12 35 336 50000 Manufacturing
13 60 359 75000 Manufacturing
14 85 403 81000 Services
15 40 418 60000 Retail
16 75 437 53000 Services
17 85 451 90000 Services
18 65 465 70000 Retail
19 95 491 100000 Services
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
model.score(X, y)
>>0.87764337132340009
I checked it manually and 0.87764 is R-squared; whereas 0.863248 is the adjusted R-squared.
There are many different ways to compute R^2 and the adjusted R^2, the following are few of them (computed with the data you provided):
from sklearn.linear_model import LinearRegression
model = LinearRegression()
X, y = df[['NumberofEmployees','ValueofContract']], df.AverageNumberofTickets
model.fit(X, y)
SST = SSR + SSE (ref definitions)
# compute with formulas from the theory
yhat = model.predict(X)
SS_Residual = sum((y-yhat)**2)
SS_Total = sum((y-np.mean(y))**2)
r_squared = 1 - (float(SS_Residual))/SS_Total
adjusted_r_squared = 1 - (1-r_squared)*(len(y)-1)/(len(y)-X.shape[1]-1)
print r_squared, adjusted_r_squared
# 0.877643371323 0.863248473832
# compute with sklearn linear_model, although could not find any function to compute adjusted-r-square directly from documentation
print model.score(X, y), 1 - (1-model.score(X, y))*(len(y)-1)/(len(y)-X.shape[1]-1)
# 0.877643371323 0.863248473832
Another way:
# compute with statsmodels, by adding intercept manually
import statsmodels.api as sm
X1 = sm.add_constant(X)
result = sm.OLS(y, X1).fit()
#print dir(result)
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832
Yet another way:
# compute with statsmodels, another way, using formula
import statsmodels.formula.api as sm
result = sm.ols(formula="AverageNumberofTickets ~ NumberofEmployees + ValueofContract", data=df).fit()
#print result.summary()
print result.rsquared, result.rsquared_adj
# 0.877643371323 0.863248473832
regressor = LinearRegression(fit_intercept=False)
regressor.fit(x_train, y_train)
print(f'r_sqr value: {regressor.score(x_train, y_train)}')

Error in using knn for multidimensional data

I am a beginer in Machine Learning, I am trying to classify multi dimensional data into two classes. Each data point is 40x6 float values. To begin with I have read my csv file. In this file shot number represents data point.
https://docs.google.com/spreadsheets/d/1tW1xJqnNZa1PhVDAE-ieSVbcdqhT8XfYGy8ErUEY_X4/edit?usp=sharing
Here is the code in python:
import pandas as pd
1 import numpy as np
2 import matplotlib.pyplot as plot
3
4 from sklearn.neighbors import KNeighborsClassifier
5
6 # Read csv data into pandas data frame
7 data_frame = pd.read_csv('data.csv')
8
9 extract_columns = ['LinearAccX', 'LinearAccY', 'LinearAccZ', 'Roll', 'pitch', 'compass']
10
11 # Number of sample in one shot
12 samples_per_shot = 40
13
14 # Calculate number of shots in dataframe
15 count_of_shots = len(data_frame.index)/samples_per_shot
16
17 # Initialize Empty data frame
18 training_index = range(count_of_shots)
19 training_data_list = []
20
21 # flag for backward compatibility
22 make_old_data_compatible_with_new = 0
23
24 if make_old_data_compatible_with_new:
25 # Convert 40 shot data to 25 shot data
26 # New logic takes 25 samples/shot
27 # old logic takes 40 samples/shot
28 start_shot_sample_index = 9
29 end_shot_sample_index = 34
30 else:
31 # Start index from 1 and continue till lets say 40
32 start_shot_sample_index = 1
33 end_shot_sample_index = samples_per_shot
34
35 # Extract each shot into pandas series
36 for shot in range(count_of_shots):
37 # Extract current shot
38 current_shot_data = data_frame[data_frame['shot_no']==(shot+1)]
39
40 # Select only the following column
41 selected_columns_from_shot = current_shot_data[extract_columns]
42
43 # Select columns from selected rows
44 # Find start and end row indexes
45 current_shot_data_start_index = shot * samples_per_shot + start_shot_sample_index
46 current_shot_data_end_index = shot * samples_per_shot + end_shot_sample_index
47 selected_rows_from_shot = selected_columns_from_shot.ix[current_shot_data_start_index:curren t_shot_data_end_index]
48
49 # Append to list of lists
50 # Convert selected short into multi-dimensional array
51
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
8
7 # Append each sliced shot into training data
6 training_data = pd.DataFrame(training_data_list, columns=extract_columns)
5 training_features = [1 for i in range(count_of_shots)]
4 knn = KNeighborsClassifier(n_neighbors=3)
3 knn.fit(training_data, training_features)
training_data_list.append([selected_columns_from_shot[extract_columns[index]].values.tolist( ) for index in range(len(extract_columns))])
After running the above code, I am getting an error
ValueError: setting an array element with a sequence.
for the line
knn.fit(training_data, training_features)

Multivariate Normal changes value of variable in PyMC

I might be doing something wrong but I can't figure out what it is. I'm trying to reproduce some results from a real state dataset from Baton Rouge, LA. The original code is written in WinBUGS here. There are some minor differences between the dataset used in the link above and the one I'm using right now. However, I think that is not significant. This is the code:
import pymc as pm, pandas as pd, numpy as np
from scipy.spatial.distance import pdist, squareform
from numpy.linalg import inv
# Loading dataset
df = pd.read_table('http://pastebin.com/raw.php?i=41us4HVj', sep=' ')
# Setting priors
beta = pm.Normal('beta', 0.0, 0.1, size=3)
mu = pm.Lambda('mu', lambda b=beta:
b[0]+b[1]*df['LivingArea']/1000.0+b[2]*df['Age'])
tau = pm.Gamma('tau', 0.1, 0.1)
phi = pm.Uniform('phi', 0.1, 10)
# Trying to build a covariate matrix
A = squareform(pdist(np.array(zip(df['Latitude'], df['Longitude']))))
# Using the powered exponential to obtain a precision matrix
precision = pm.Lambda('exp', lambda u=A, tau=tau, phi=phi, kappa=1:
inv((1/tau)*np.exp(-(phi*u)**kappa)))
If I inspect the value of mu, I get this:
mu.value
Out[2]:
0 24.568272
1 2.909063
2 -2.778916
3 28.206696
4 -0.270571
5 -2.865153
6 14.158162
7 31.466438
8 44.681351
9 22.191397
10 -6.412350
11 11.709424
12 25.453254
13 24.366674
14 34.711048
...
55 24.625763
56 21.763089
57 65.108136
58 15.428714
59 20.992329
60 36.384037
61 16.730507
62 23.021763
63 54.887747
64 30.612696
65 52.685840
66 59.612372
67 18.822422
68 18.940658
69 72.678188
Length: 70, dtype: float64
However, after running MvNormal, the value of mu is changed:
w = pm.MvNormal('w', mu, precision)
mu.value
Out[4]:
0 -107.913779
1 -1.243466
2 8.283926
3 26.412651
4 1.806728
5 -1.300734
6 -80.657396
7 71.614343
8 -3.817774
9 -10.283683
10 -3.804962
11 8.639403
12 18.927553
13 -10.004095
14 -37.431770
...
55 88.612179
56 18.011459
57 -7.421157
58 7.974531
59 -3.697444
60 -17.520367
61 36.453531
62 -39.235745
63 -6.701737
64 68.672902
65 -44.040923
66 11.826075
67 -21.995198
68 -15.886362
69 4.653335
Length: 70, dtype: float64
By the way, this only happens to mu. The precision variable remains the same.
Did I make a mistake?
UPDATE:
Already filed an issue on GitHub. After further inspection, the culprit seems to be the pd.Series object that is used in the mu variable. If I convert or remove the Series, mu won't change after calling MvNormal.
Thanks!

Categories

Resources