Error when trying to use seasonal_decompose from Statsmodels - python

I have the following dataframe called 'data':
Month
Revenue Index
1920-01-01
1.72
1920-02-01
1.83
1920-03-01
1.94
...
...
2021-10-01
114.20
2021-11-01
115.94
2021-12-01
116.01
This is essentially a monthly revenue index on which I am trying to use seasonal_decompose with the following code:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative')
But unfortunately I get the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-39-08e3139bbf77> in <module>()
----> 1 result = seasonal_decompose(data['Consumptieprijsindex'], model='multiplicative')
2 rcParams['figure.figsize'] = 12, 6
3 plt.rc('lines', linewidth=1, color='r')
4
5 fig = result.plot()
/usr/local/lib/python3.7/dist-packages/statsmodels/tsa/seasonal.py in seasonal_decompose(x, model, filt, freq, two_sided, extrapolate_trend)
125 freq = pfreq
126 else:
--> 127 raise ValueError("You must specify a freq or x must be a "
128 "pandas object with a timeseries index with "
129 "a freq not set to None")
ValueError: You must specify a freq or x must be a pandas object with a timeseries index with a freq not set to None
Does anyone know how to solve this issue? Thanks!

The following code in the comments answered my question:
result = seasonal_decompose(data['Revenue Index'], model='multiplicative', period=12)

Related

why am I getting the valueError while sentimental analysis?

I was trying to do the sentimental analysis of amazon product reviews here and i was trying to get the pie chart and bar graph but got this error.
not getting the pie chart and bargraph
ValueError Traceback (most recent call last)
<ipython-input-90-2089ce8a5ab8> in <module>
----> 1 categorical_variable_summary(df,"overall")
1 frames
<ipython-input-87-29535c4328ba> in categorical_variable_summary(df, column_name)
3 fig = make_subplots(rows = 1, cols = 2,
4 subplot_titles=('Countplot', 'Percentage'),
----> 5 specs=[[{'types' : 'xy'}],[{'types': 'domain'}]])
6
7 fig.add_trace(go.Bar( y = df[column_name].value_counts().values.tolist(),
/usr/local/lib/python3.7/dist-packages/plotly/subplots.py in make_subplots(rows, cols, shared_xaxes, shared_yaxes, start_cell, print_grid, horizontal_spacing, vertical_spacing, subplot_titles, column_widths, row_heights, specs, insets, column_titles, row_titles, x_title, y_title, figure, **kwargs)
448 dimensions ({rows} x {cols}).
449 Received value of type {typ}: {val}""".format(
--> 450 rows=rows, cols=cols, typ=type(specs), val=repr(specs)
451 )
452 )
ValueError:
The 'specs' argument to make_subplots must be a 2D list of dictionaries with dimensions (1 x 2).
Received value of type <class 'list'>: [[{'types': 'xy'}], [{'types': 'domain'}]]

Create non-standard pandas frequency ("dekads" = 3 periods per month)

I want to create a new frequency to assign to a pandas.DateTimeIndex. This is a dekad frequency where there are 36 periods in a year. Three per month. The first is always on the 10th day of the month, the second the 20th day of the month, and the final is the final day of that month.
The difficulty is that the final day of the month:
differs in February depending on whether it's a leap year (28th or 29th)
differs depending on the number of days in that month (28, 29, 30, 31)
Ultimately, however, it is a set frequency (3 per month, 36 periods per year).
The reason is that statsmodels.tsa.holtwinters models require indexes with a given frequency to make forecasts. When I try to run the holtwinters forecast I get the following warning message:
/home/tommy/miniconda3/envs/ml/lib/python3.8/site-packages/statsmodels/tsa/base/tsa_model.py:216: ValueWarning: A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
This is what dekad timesteps look like:
from pandas.tseries.offsets import MonthEnd
dates = pd.date_range("2000-01-01", "2003-01-01")
_dekads = [d for d in dates if d.day in [10, 20]]
_month_ends = [d + MonthEnd(1) for d in dates if d.day == 10]
dekads = sorted(np.concatenate([_dekads, _month_ends]))
I want to be able to assign a dekad frequency to the index
df = pd.DataFrame({"y": np.random.random(len(dekads))}, index=dekads)
df.head()
Out[]:
y
2000-01-10 0.013236
2000-01-20 0.430563
2000-01-31 0.028183
2000-02-10 0.050080
2000-02-20 0.092100
I'd like to be able to assign a "dekad" frequency to the object. How can I create my own dekad frequency?
df.index.freq = "dekad"
Out[]:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets._get_offset()
KeyError: 'DEKAD'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets.to_offset()
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets._get_offset()
ValueError: Invalid frequency: DEKAD
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-155-aa7b4737fd5a> in <module>
7
8 df = pd.DataFrame({"y": np.random.random(len(dekads))}, index=dekads)
----> 9 df.index.freq = "dekad"
~/miniconda3/envs/ml/lib/python3.8/site-packages/pandas/core/indexes/extension.py in fset(self, value)
62
63 def fset(self, value):
---> 64 setattr(self._data, name, value)
65
66 fget.__name__ = name
~/miniconda3/envs/ml/lib/python3.8/site-packages/pandas/core/arrays/datetimelike.py in freq(self, value)
1090 def freq(self, value):
1091 if value is not None:
-> 1092 value = to_offset(value)
1093 self._validate_frequency(self, value)
1094
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets.to_offset()
pandas/_libs/tslibs/offsets.pyx in pandas._libs.tslibs.offsets.to_offset()
ValueError: Invalid frequency: dekad
# How can I create a new freq object in pandas
The purpose of this exercise:
df = pd.read_csv(
"https://gist.githubusercontent.com/tommylees112/2b1b2dda43d91ea9346a6edaa6788ec8/raw/644af74955ce078d1c4d55a2ffd6a55eeb59bad4/demo_data_SO_02092021.csv"
).astype({"time": "datetime64[ns]"}).set_index("time")
train, test = df.iloc[:-100], df.iloc[-100:]
f, ax = plt.subplots(figsize=(12, 4))
ax.plot(train, label="train")
ax.plot(test, label="test")
plt.xticks(rotation=70)
plt.legend()
from statsmodels.tsa.holtwinters import SimpleExpSmoothing, ExponentialSmoothing
# set seasonality parameters
m = 36
alpha = 1/(2*m)
model = ExponentialSmoothing(train["vci"],trend="mul").fit()
preds = model.forecast(len(test))
preds.index = test.index
f, ax = plt.subplots(figsize=(12, 4))
ax.plot(train.index, model.fittedvalues, label="Train Preditions")
ax.plot(test.index, preds, label="Test Preditions")
ax.plot(df.index, df["vci"], ls="--", color="k", alpha=0.6)
plt.xticks(rotation=70)
plt.legend()
This forecast is clearly poor and does not reflect the learned seasonality. I believe this is an issue with the fact that no frequency has been assigned to the datetime index.
If there are alternative methods for achieving these goals then I would be very keen to explore those options. I want to create a new frequency to assign to a pandas.DateTimeIndex. The reason is that statsmodels.tseries models require indexes with a given frequency to make forecasts.

How to create a DataFrame in Pandas

I am using the playerStat.csv which includes 8 columns from which I only need 2. So I`m trying to create a new DataFrame with only those 2 columns.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
dataset = pd.read_csv("HLTVData/playerStats.csv")
dataset.head(20)
I only need the ADR and the Rating.
So I first create a matrix with the data set.
mat = dataset.as_matrix()
#4 is the ADR and 6 is the Rating
newDAtaSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
But it didn`t work, it threw an exception
NameError Traceback (most recent call last)
<ipython-input-10-1f975cc2514a> in <module>()
1 #4 is the ADR and 6 is the Rating
----> 2 newDataSet = pd.DataFrame(dataset, index=indexMatrix,columns=(mat[:,4],mat[:,6]) )
NameError: name 'indexMatrix' is not defined
I also tried using the dataset.
newDataSet = pd.DataFrame(dataset, index=np.array(range(dataset.shape[0])), columns=dataset['ADR'])
/home/tensor/miniconda3/envs/tensorflow35openvc/lib/python3.5/site-packages/pandas/core/internals.py in _make_na_block(self, placement, fill_value)
3984
3985 dtype, fill_value = infer_dtype_from_scalar(fill_value)
-> 3986 block_values = np.empty(block_shape, dtype=dtype)
3987 block_values.fill(fill_value)
3988 return make_block(block_values, placement=placement)
MemoryError:
I think you need parameter usecols in read_csv:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=['ADR','Rating'])
Or:
dataset = pd.read_csv("HLTVData/playerStats.csv", usecols=[4,6])

Having trouble with seaborn module in python

I am trying to draw some basic plots using the seaborn's jointplot() method.
My pandas data frame looks like this:
Out[250]:
YEAR Yields avgSumPcpn avgMaxSumTemp avgMinSumTemp
1970 5000 133.924981 30.437124 19.026974
1971 5560 107.691316 31.161974 19.278186
1972 5196 116.830066 31.454192 19.443712
1973 4233 181.550733 30.373581 19.097679
1975 5093 112.137538 30.428966 18.863224
I am trying to draw 'Yields' against 'YEAR' (So a plot to see how 'Yields' is varying over time). A simple plot.
But when I do this:
sns.jointplot(x='YEAR',y='Yeilds', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
I am getting the following error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-251-587582a746b8> in <module>()
3 #ax = plt.axes()
4 #sns_sum_reg_min_temp_pcpn = sns.regplot(x='avgSumPcpn',y='avgMaxSumTemp', data = df_sum_temp_pcpn)
----> 5 sns.jointplot(x='Yeilds',y='YEAR', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
6 plt.title('Avg Summer Precipitation vs Yields of Wharton TX', fontsize = 10)
7
//anaconda/lib/python2.7/site-packages/seaborn/distributions.pyc in jointplot(x, y, data, kind, stat_func, color, size, ratio, space, dropna, xlim, ylim, joint_kws, marginal_kws, annot_kws, **kwargs)
793 grid = JointGrid(x, y, data, dropna=dropna,
794 size=size, ratio=ratio, space=space,
--> 795 xlim=xlim, ylim=ylim)
796
797 # Plot the data using the grid
//anaconda/lib/python2.7/site-packages/seaborn/axisgrid.pyc in __init__(self, x, y, data, size, ratio, space, dropna, xlim, ylim)
1637 if dropna:
1638 not_na = pd.notnull(x) & pd.notnull(y)
-> 1639 x = x[not_na]
1640 y = y[not_na]
1641
TypeError: string indices must be integers, not Series
So I printed out the types of each column. Here is how:
for i in summer_pcpn_temp_yeild.columns.values.tolist():
print type(summer_pcpn_temp_yeild[[i]])
print type(summer_pcpn_temp_yeild.index.values)
which gives me:
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
<type 'numpy.ndarray'>
SO, I am not being able to understand how to fix it.
Any help would be greatly appreciated.
Thanks
Check that the YEAR and Yields have integer ( not string) types of values.
Try changing x='Yeilds' to x='Yields' in your call to jointplot:
sns.jointplot(x='YEAR',y='Yeilds', data = summer_pcpn_temp_yeild, kind = 'reg', size = 10)
The error message is misleading. Seaborn can't find the column named "Yeilds" in your summer_pcpn_temp_yeild dataframe, because the dataframe column is spelled "Yields".
I had the same problem, and fixed it by correcting the x= argument to sns.jointplot()

Can pandas groupby transform a DataFrame into a Series?

I would like to use pandas and statsmodels to fit a linear model on subsets of a dataframe and return the predicted values. However, I am having trouble figuring out the right pandas idiom to use. Here is what I am trying to do:
import pandas as pd
import statsmodels.formula.api as sm
import seaborn as sns
tips = sns.load_dataset("tips")
def fit_predict(df):
m = sm.ols("tip ~ total_bill", df).fit()
return pd.Series(m.predict(df), index=df.index)
tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
This raises the following error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-139-b3d2575e2def> in <module>()
----> 1 tips["predicted_tip"] = tips.groupby("day").transform(fit_predict)
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in transform(self, func, *args, **kwargs)
3033 return self._transform_general(func, *args, **kwargs)
3034 except:
-> 3035 return self._transform_general(func, *args, **kwargs)
3036
3037 # a reduction transform
/Users/mwaskom/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _transform_general(self, func, *args, **kwargs)
2988 group.T.values[:] = res
2989 else:
-> 2990 group.values[:] = res
2991
2992 applied.append(group)
ValueError: could not broadcast input array from shape (62) into shape (62,6)
The error makes sense in that I think .transform wants to map a DataFrame to a DataFrame. But is there a way to do a groupby operation on a DataFrame, pass each chunk into a function that reduces it to a Series (with the same index), and then combine the resulting Series into something that can be inserted into the original dataframe?
The top part here is the same, I'm just using a toy dataset b/c I'm behind a firewall.
tips = pd.DataFrame({ 'day':list('MMMFFF'), 'tip':range(6),
'total_bill':[10,40,20,80,50,40] })
def fit_predict(df):
m = sm.ols("tip ~ total_bill", df).fit()
return pd.Series(m.predict(df), index=df.index)
If you change 'transform' to 'apply', you'll get:
tips.groupby("day").apply(fit_predict)
day
F 3 2.923077
4 4.307692
5 4.769231
M 0 0.714286
1 1.357143
2 0.928571
That's not quite what you want, but if you drop level=0, you can proceed as desired:
tips['predicted'] = tips.groupby("day").apply(fit_predict).reset_index(level=0,drop=True)
day tip total_bill predicted
0 M 0 10 0.714286
1 M 1 40 1.357143
2 M 2 20 0.928571
3 F 3 80 2.923077
4 F 4 50 4.307692
5 F 5 40 4.769231

Categories

Resources