How to create 'dates' column for using statsmodels.tsa.arima_model - python

The pandas DataFrame df contains the following info:
df
Date Param1
0 2011-12-21 55.12
1 2011-12-28 70.12
2 2012-01-04 33.73
3 2012-01-11 53.14
4 2012-01-18 74.16
To be able to use ARIMA I convert dates as follows:
dates = dates_from_range('2011m1', length=len(df.Param1))
df.index = dates
df = df[['Param1']]
from statsmodels.tsa.arima_model import ARIMA
res = ARIMA(df, (1,0,0), freq='M').fit()
However, in this case instead of getting correct sequence of dates I get this:
Date Param1
2011-01-31 55.12
2011-02-29 70.12
2011-03-31 33.73
2011-04-30 53.14
2011-05-31 74.16
How to avoid using dates_from_range and construct dates manually?

Related

Is there a way to create a dataframe in python from start date to end date

Is there way to create a dataframe(using pandas) from start date to End date with random values
For example, the required data frame:
date value
2015-06-25 12
2015-06-26 23
2015-06-27 3.4
2015-06-28 5.6
Is this dataframe I need to set "from date" to "To date" so that there is no need to type manually because the rows keep on increasing and "value" column should have values generated randomly
You can use pandas.date_range and numpy.random.uniform:
import numpy as np
dates = pd.date_range('2015-06-25', '2015-06-28', freq='D')
df = pd.DataFrame({'date': dates,
'value': np.random.uniform(0, 50, size=len(dates))})
output:
date value
0 2015-06-25 39.496971
1 2015-06-26 29.947877
2 2015-06-27 7.328549
3 2015-06-28 6.738427
import pandas as pd
import numpy as np
beg = '2015-06-25 '
end = '2015-06-28'
index = pd.date_range(beg, end)
values = np.random.rand(len(index))
df = pd.DataFrame(values, index=index)
index
0
2015-06-25 00:00:00
0.865455754057059
2015-06-26 00:00:00
0.959574113247196
2015-06-27 00:00:00
0.8634928651970131
2015-06-28 00:00:00
0.6448579391510935

Pandas - Add mean, max, min as row in dataframe

Dataframe evntually converts to Excel...
Trying to create a additional row with the avg and max above each column.
Do not want to disturb the original headers for the actual data.
enter image description here
I dont want to hard-code column names as these will change need kind of abstract. I attempted to create a max but failed. I need the max above the column headers.
Try this, I don't know how to create above the dataframe, but I believe that in the end it might be a good solution:
import pandas as pd
df = {
'date and time':['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04'],
'<PowerAC--->':[40, 20, 9, 12]
}
df = pd.DataFrame(df)
cols = ['<PowerAC--->']
agg = (df[cols].agg(['mean', max]))
out = pd.concat([df, agg])
print(out)
A one-liner method which also remove the "NaN" values to make it visually better (I'm a bit OCD ;))
df.append(df.agg({'<PowerAC--->' : ['mean', max]})).fillna('')
I would say it's a good idea to keep your data separated from the reporting on it - I don't really see the logic for an "additional row above the column".
I would generate statistics for the overall data as a separate dataframe.
import pandas as pd
import numpy as np
np.random.seed(1)
t = pd.date_range(start='2022-05-31', end='2022-06-07')
x = np.random.rand(len(t))
df = pd.DataFrame({'date': t, 'data': x})
print(df)
# The 'numeric_only=False' behaviour will become default in a future version of pandas
d_mean = df.mean(numeric_only=False)
d_max = df.max()
# We need to transpose this, as the `d_mean` and `d_max` are Series (columns), and we want them as rows
df_stats = pd.DataFrame({'mean': d_mean, 'max':d_max}).transpose()
print(df_stats)
df output:
date data
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114
3 2022-06-03 0.302333
4 2022-06-04 0.146756
5 2022-06-05 0.092339
6 2022-06-06 0.186260
7 2022-06-07 0.345561
df_stats output:
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
You could add this and the dataframe together with
pd.concat([df_stats, df])
which looks like
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
0 2022-05-31 00:00:00 0.417022
1 2022-06-01 00:00:00 0.720324
2 2022-06-02 00:00:00 0.000114
3 2022-06-03 00:00:00 0.302333
4 2022-06-04 00:00:00 0.146756
5 2022-06-05 00:00:00 0.092339
6 2022-06-06 00:00:00 0.18626
7 2022-06-07 00:00:00 0.345561
but I would keep them separate unless you've got a very good reason to.
There may be some way which makes sense using a multi-index, but that's a bit beyond me, and probably beyond the scope of this question.
Edit: If you don't infer any meaning from the max and mean of the date column but still want something compatiable with that column (i.e. still a datetime but effectively null) you could replace it by np.datetime64['NaT'] (NaT similar to NaN, but "not a time"):
df_stats['date'] = np.datetime64['NaT']
print(pd.concat([df_stats, df]).head())
output:
date data
mean NaT 0.276339
max NaT 0.720324
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114

Pandas: Average values in groups of X days

I want to Average the value of AAPL.High in groups of 10 days (JAN/01 to JAN/10), using the day number 10 as the reference number.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
The Idea of the code is approximately:
df1['demand'] = df1.groupby(['supplier_name', 'date'])['difference'].transform('mean').fillna(0)
Simple case of define index as the dates then just use resample()
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
df.index = pd.to_datetime(df["Date"])
df.resample("10d").agg({"AAPL.High":np.mean})
output
AAPL.High
Date
2015-02-17 130.657501
2015-02-27 129.675001
2015-03-09 126.661251
2015-03-19 127.134283
2015-03-29 126.533333
... ...
2017-01-07 119.532001
2017-01-17 120.841248
2017-01-27 125.740000
2017-02-06 133.172500
2017-02-16 135.899994

pandas period_range - gaining access to the

pd.period_range(start='2017-01-01', end='2017-01-01', freq='Q') gives me the following:
PeriodIndex(['2017Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
I would like to gain access to '2017Q1' to put it into a different column in the dataframe.
I have a date column with dates, e.g., 1/1/2017. I have other column where I'd like to put the string calculated by the period range. It seems like that would be an efficient way to update the column with the fiscal quarter. I can't seem to access it. It isn't subscriptable, and I can't get at it even when I assign it to a variable. Just wondering what I'm missing.
pandas.PeriodIndex:
You were almost there. It just needed to be assigned to a column
import numpy as np
from datetime import timedelta, datetime
import pandas as pd
# list of dates - test data
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
# create the dataframe
df = pd.DataFrame({'dates': list_of_dates})
dates
2017-01-01
2017-01-05
2017-01-09
2017-01-13
2017-01-17
df['Quarters'] = pd.PeriodIndex(df.dates, freq='Q-DEC')
Output:
print(df.head())
dates Quarters
2017-01-01 2017Q1
2017-01-05 2017Q1
2017-01-09 2017Q1
2017-01-13 2017Q1
2017-01-17 2017Q1
print(df.tail())
dates Quarters
2019-08-31 2019Q3
2019-09-04 2019Q3
2019-09-08 2019Q3
2019-09-12 2019Q3
2019-09-16 2019Q3

Resample Daily Data to Monthly with Pandas (date formatting)

I am trying to resample some data from daily to monthly in a Pandas DataFrame. I am new to pandas and maybe I need to format the date and time first before I can do this, but I am not finding a good tutorial out there on the correct way to work with imported time series data. Everything I find is automatically importing data from Yahoo or Quandl.
Here is what I have in my DataFrame:
dataframe segment screenshot
Here is the code I used to create my DataFrame:
#Import excel file into a Pandas DataFrame
df = pd.read_excel(open('2016_forex_daily_returns.xlsx','rb'), sheetname='Sheet 1')
#Calculate the daily returns
df['daily_ret'] = df['Equity'].pct_change()
# Assume an average annual risk-free rate over the period of 5%
df['excess_daily_ret'] = df['daily_ret'] - 0.05/252
Can someone help me understand what I need to do with the "Date" and "Time" columns in my DataFrame so I can resample?
For create DataFrame is possible use:
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print (df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
I think you can first cast to_datetime column date and then use resample with some aggregating functions like sum or mean:
df.Date = pd.to_datetime(df.Date)
df1 = df.resample('M', on='Date').sum()
print (df1)
Equity excess_daily_ret
Date
2016-01-31 2738.37 0.024252
df2 = df.resample('M', on='Date').mean()
print (df2)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
df3 = df.set_index('Date').resample('M').mean()
print (df3)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
To resample from daily data to monthly, you can use the resample method. Specifically for daily returns, the example below demonstrates a possible solution.
The following data is taken from an analysis performed by AQR. It represents the market daily returns for May, 2019. The following code may be used to construct the data as a pd.DataFrame.
import pandas as pd
dates = pd.DatetimeIndex(['2019-05-01', '2019-05-02', '2019-05-03', '2019-05-06',
'2019-05-07', '2019-05-08', '2019-05-09', '2019-05-10',
'2019-05-13', '2019-05-14', '2019-05-15', '2019-05-16',
'2019-05-17', '2019-05-20', '2019-05-21', '2019-05-22',
'2019-05-23', '2019-05-24', '2019-05-27', '2019-05-28',
'2019-05-29', '2019-05-30', '2019-05-31'],
dtype='datetime64[ns]', name='DATE', freq=None)
daily_returns = array([-7.73787813e-03, -1.73277604e-03, 1.09124031e-02, -3.80437796e-03,
-1.66513456e-02, -1.67262934e-03, -2.77427734e-03, 4.01713274e-03,
-2.50407102e-02, 9.23270367e-03, 5.41897568e-03, 8.65419524e-03,
-6.83456209e-03, -6.54787106e-03, 9.04322511e-03, -4.05811322e-03,
-1.33152640e-02, 2.73398876e-03, -9.52000000e-05, -7.91438809e-03,
-7.16881982e-03, 1.19255102e-03, -1.24209547e-02])
daily_returns = pd.DataFrame(index = index, data= may.values, columns = ["returns"])
Assuming you don't have daily price data, you can resample from daily returns to monthly returns using the following code.
>>> daily_returns.resample("M").apply(lambda x: ((x + 1).cumprod() - 1).last("D"))
-0.06532
If you refer to their monthly dataset, this confirms that the market return for May 2019 was approximated to be -6.52% or -0.06532.
First, concatenate the 'Date' and 'Time' columns with space in between. Then convert that into a DateTime format using pd.to_datetime().
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print(df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
df = df.drop(['Date', 'Time'], axis= 'columns').set_index(pd.to_datetime(df.Date + ' ' + df.Time))
df.index.name = 'Date/Time'
print(df)
Equity
Date/Time
2016-01-03 22:16:22 300.38
2016-01-04 22:16:00 300.65
2016-01-05 14:26:02 301.65
2016-01-06 19:08:13 302.10
2016-01-07 18:39:00 302.55
2016-01-08 22:16:04 308.24
2016-01-11 02:49:39 306.69
2016-01-14 15:46:39 307.93
2016-01-19 15:56:31 308.18
Now you can resample to any format you desire.
I have created a random DataFrame similar to yours here:
import numpy as np
import pandas as pd
dates = [x for x in pd.date_range(end=pd.datetime.today(), periods=1800)]
counts = [x for x in np.random.randint(0, 10000, size=1800)]
df = pd.DataFrame({'dates': dates, 'counts': counts}).set_index('dates')
Here are the procedures to aggregate the sum of counts for each week as an example:
df['week'] = df.index.week
df['year'] = df.index.year
target_df = df.groupby(['year', 'week']).agg({'counts': np.sum})
Where the output of target_df is:
counts
year week
2015 3 29877
4 36859
5 36872
6 36899
7 37769
. . .
. . .
. . .

Categories

Resources