Using statsmodels.seasonal_decompose() without DatetimeIndex but with Known Frequency - python

I have a time-series signal I would like to decompose in Python, so I turned to statsmodels.seasonal_decompose(). My data has frequency of 48 (half-hourly). I was getting the same error as this questioner, where the solution was to change from an Int index to a DatetimeIndex. But I don't know the actual dates/times my data is from.
In this github thread, one of the statsmodels contributors says that
"In 0.8, you should be able to specify freq as keyword argument to
override the index."
But this seems not to be the case for me. Here is a minimal code example illustrating my issue:
import statsmodels.api as sm
dta = pd.Series([x%3 for x in range(100)])
decomposed = sm.tsa.seasonal_decompose(dta, freq=3)
AttributeError: 'RangeIndex' object has no attribute 'inferred_freq'
Version info:
import statsmodels
print(statsmodels.__version__)
0.8.0
Is there a way to decompose a time-series in statsmodels with a specified frequency but without a DatetimeIndex?
If not, is there a preferred alternative for doing this in Python? I checked out the Seasonal package, but its github lists 0 downloads/month, one contributor, and last commit 9 months ago, so I'm not sure I want to rely on that for my project.

Thanks to josef-pkt for answering this on github. There is a bug in statsmodels 0.8.0 where it always attempts to calculate an inferred frequency based on a DatetimeIndex, if passed a Pandas object.
The workaround when using Pandas series is to pass their values in a numpy array to seasonal_decompose(). For example:
import statsmodels.api as sm
my_pandas_series = pd.Series([x%3 for x in range(100)])
decomposed = sm.tsa.seasonal_decompose(my_pandas_series.values, freq=3)
(no errors)

Related

Orange Python Script create custom timestamp (Orange Data Mining Windows 10)

I am trying to achieve a script, which will create an Orange data table with just a single column containing a custom time stamp.
Usecase: I need a complete time stamp so I can merge some other csv files later on. I'm working in the Orange GUI BTW and am not working in the actual python shell or any other IDE (in case this information makes any difference).
Here's what I have come up with so far:
From Orange.data import Domain, Table, TimeVariable
import numpy as np
domain = Domain([TimeVariable("Timestamp")])
# Timestamp from 22-03-08 to 2022-03-08 in minute steps
arr = np.arange("2022-03-08", "2022-03-15", dtype="datetime64[m]")
# Obviously necessary to achieve a correct format for the matrix
arr = arr.reshape(-1,1)
out_data = Table.from_numpy(domain, arr)
However the results do not match:
>>> print(arr)
[['2022-03-08T00:00']
['2022-03-08T00:01']
['2022-03-08T00:02']
...
['2022-03-14T23:57']
['2022-03-14T23:58']
['2022-03-14T23:59']]
>>> print(out_data)
[[27444960.0],
[27444961.0],
[27444962.0],
...
[27455037.0],
[27455038.0],
[27455039.0]]
Obviously I'm missing something when handing over the data from numpy but I'm having a real hard time trying to understand the documentation.
I've also found this post which seems to tackle a similar issue, but I haven't figured out how to apply the solution on my problem.
I would be really glad if anyone could help me out here. Please try to use simple terms and concepts.
Thank you for the question, and apologies for the weak documentation of the TimeVariable.
In your code, you must change two things to work.
First, it is necessary to set whether the TimeVariable includes time and/or date data:
TimeVariable("Timestamp", have_date=True) stores only date information -- it is analogous to datetime.date
TimeVariable("Timestamp", have_time=True) stores only time information (without date) -- it is analogous to datetime.time
TimeVariable("Timestamp", have_time=True, have_date=True) stores date and time -- it is analogous to datetime.datetime
You didn't set that information in your example, so both were False by default. For your case, you must set both to True since your attribute will hold the date-time values.
The other issue is that Orange's Table stores date-time values as UNIX epoch (seconds from 1970-01-01), and so also Table.from_numpy expect values in this format. Values in your current arr array are in minutes instead. I just transformed the dtype in the code below to seconds.
Here is the working code:
from Orange.data import Domain, Table, TimeVariable
import numpy as np
# Important: set whether TimeVariable contains time and/or date
domain = Domain([TimeVariable("Timestamp", have_time=True, have_date=True)])
# Timestamp from 22-03-08 to 2022-03-08 in minute steps
arr = np.arange("2022-03-08", "2022-03-15", dtype="datetime64[m]").astype("datetime64[s]")
# necessary to achieve a correct format for the matrix
arr = arr.reshape(-1,1)
out_data = Table.from_numpy(domain, arr)

Why does not Seaborn Relplot print datetime value on x-axis?

I'm trying to solve a Kaggle Competition to get deeper into data science knowledge. I'm dealing with an issue with seaborn library. I'm trying to plot a distribution of a feature along the date but the relplot function is not able to print the datetime value. On the output, I see a big black box instead of values.
Here there is my code, for plotting:
rainfall_types = list(auser.loc[:,1:])
grid = sns.relplot(x='Date', y=rainfall_types[0], kind="line", data=auser);
grid.fig.autofmt_xdate()
Here there is the
Seaborn.relpot output and the head of my dataset
I found the error. Pratically, when you use pandas.read_csv(dataset), if your dataset contains datetime column they are parsed as object, but python read these values as 'str' (string). So when you are going to plot them, matplotlib is not able to show them correctly.
To avoid this behaviour, you should convert the datetime value into datetime object by using:
df = pandas.read_csv(dataset, parse_date='Column_Date')
In this way, we are going to indicate to pandas library that there is a date column identified by the key 'Column_Date' and it has to be converted into datetime object.
If you want, you could use the Column Date as index for your dataframe, to speed up the analyis along the time. To do it add argument index='Column_Date' at your read_csv.
I hope you will find it helpful.

how to convert datetime to numeric data type?

I have a dataset as
time MachineId
1530677359000000000 01081081
1530677363000000000 01081081
1530681023000000000 01081090
1530681053000000000 01081090
1530681531000000000 01081090
So my codes goes like:
import pandas as pd
from datetime import datetime
import time
import datetime
import matplotlib.pyplot as plt
import matplotlib.dates as mdate
df= pd.read_csv('acn.csv')`
df['time']=pd.to_datetime(df['time'], unit='ns')` #converting the epoch nanosec time to datetime-format
print(df.head())
Output:
time MachineId
0 2018-07-04 04:09:19 1081081.0
1 2018-07-04 04:09:23 1081081.0
2 2018-07-04 05:10:23 1081090.0
3 2018-07-04 05:10:53 1081090.0
4 2018-07-04 05:18:51 1081090.0
and now I want to change my data of time to numeric to generate a plot between time and machine id
dates = plt.dates.date2num(df['time'])
df.plot(kind='scatter',x='dates',y='MachineId')
plt.show()
which throws a error as :
AttributeError: 'module' object has no attribute 'dates'
How can I change datetime format to numeric so that a plot can be formed ?
You got the following error:
AttributeError: 'module' object has no attribute 'dates'
Your error message is telling you that matplotlib.pyplot.dates (plt.dates) doesn't exist. (The error says that there's a module that you're calling 'dates' but it doesn't exist).
So you need to fix that error before you worry about converting anything. Did you mean to call matplotlib.dates.date2num instead? In your code you have the following:
import matplotlib.dates as mdate
So maybe you meant to call mdate.date2num instead? That should eliminate the AttributeError.
If that doesn't work for you, you could try what is suggested in the link provided by one of the other commenters, to use pandas to_pydatetime. I'm not familiar with it, but in this example page, it is accessed as Series.dt.to_pydatetime()
All of this converting is just necessary because you are trying to use df.plot; maybe you should consider just calling matplotlib directly. For example, could you just use plt.plot_date instead? (here's the link to it). Pandas is excellent, but the plotting interface isn't as mature as the rest of it. As an example (I'm not saying this is the exact problem you are having) but here is a known bug in pandas regarding plotting dates. Here is an older stack overflow thread where someone stubs out a plt.plot_date method for you.
You can directly plot dates as well. For example if you want to have the date on the x-axis you pass the dates in ax.plot(df.time, ids). I think this might the closest thing to what you look for.

Python matplotlib.dates.date2num: converting numpy array to matplotlib datetimes

I am trying to plot a custom chart with datetime axis. My understanding is that matplotlib requires a float format which is days since epoch. So, I want to convert a numpy array to the float epoch as required by matplotlib.
The datetime values are stored in a numpy array called t:
In [235]: t
Out[235]: array(['2008-12-01T00:00:59.000000000-0800',
'2008-12-01T00:00:59.000000000-0800',
'2008-12-01T00:00:59.000000000-0800',
'2008-12-01T00:09:26.000000000-0800',
'2008-12-01T00:09:41.000000000-0800'], dtype='datetime64[ns]')
Apparently, matplotlib.dates.date2num only accepts a sequence of python datetimes as input (not numpy datetimes arrays):
import matplotlib.dates as dates
plt_dates = dates.date2num(t)
raises AttributeError: 'numpy.datetime64' object has no attribute 'toordinal'
How should I resolve this issue? I hope to have a solution that works for all types of numpy.datetime like object.
My best workaround (which I am not sure to be correct) is not to use date2num at all. Instead, I try to use the following:
z = np.array([0]).astype(t.dtype)
plt_dates = (t - z)/ np.timedelta64(1,'D')
Even, if this solution is correct, it is nicer to use library functions, instead of manual adhoc workarounds.
For a quick fix, use:
import matplotlib.dates as dates
plt_dates = dates.date2num(t.to_pydatetime())
or:
import matplotlib.dates as dates
plt_dates = dates.date2num(list(t))
It seems the latest (matplotlib.__version__ '2.1.0') does not like numpy arrays... Edit: In my case, after checking the source code, the problem seems to be that the latest matplotlib.cbook cannot create an iterable from the numpy array and thinks the array is a number.
For similar but a bit more complex problems, check http://stackoverflow.com/questions/13703720/converting-between-datetime-timestamp-and-datetime64, possibly Why do I get "python int too large to convert to C long" errors when I use matplotlib's DateFormatter to format dates on the x axis?, and maybe matplotlib plot_date AttributeError: 'numpy.datetime64' object has no attribute 'toordinal' (if someone answers)
Edit: someone answered, his code using to_pydatetime() seems best, also: pandas 0.21.0 Timestamp compatibility issue with matplotlib, though that did not work in my case (because of python 2???)

statsmodles AR model error when calling params

New to statsmodels, trying to use statsmodels.tsa.ar_model to fit a pandas timeseries.
#pull one series from dataframe
y=data.sentiment
armodel=sm.tsa.ar_model.AR(y, freq='D').fit()
armodel.params()
gets the following error:
C:\Python27\lib\site-packages\pandas\lib.pyd in pandas.lib.SeriesIndex.__set__ (pandas\lib.c:27817)()
AssertionError: Index length did not match values
Any ideas?
You should upgrade to current master, if you can. This was fixed here.

Categories

Resources