Pandas: Average values in groups of X days - python

I want to Average the value of AAPL.High in groups of 10 days (JAN/01 to JAN/10), using the day number 10 as the reference number.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
The Idea of the code is approximately:
df1['demand'] = df1.groupby(['supplier_name', 'date'])['difference'].transform('mean').fillna(0)

Simple case of define index as the dates then just use resample()
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
df.index = pd.to_datetime(df["Date"])
df.resample("10d").agg({"AAPL.High":np.mean})
output
AAPL.High
Date
2015-02-17 130.657501
2015-02-27 129.675001
2015-03-09 126.661251
2015-03-19 127.134283
2015-03-29 126.533333
... ...
2017-01-07 119.532001
2017-01-17 120.841248
2017-01-27 125.740000
2017-02-06 133.172500
2017-02-16 135.899994

Related

Is there a way to create a dataframe in python from start date to end date

Is there way to create a dataframe(using pandas) from start date to End date with random values
For example, the required data frame:
date value
2015-06-25 12
2015-06-26 23
2015-06-27 3.4
2015-06-28 5.6
Is this dataframe I need to set "from date" to "To date" so that there is no need to type manually because the rows keep on increasing and "value" column should have values generated randomly
You can use pandas.date_range and numpy.random.uniform:
import numpy as np
dates = pd.date_range('2015-06-25', '2015-06-28', freq='D')
df = pd.DataFrame({'date': dates,
'value': np.random.uniform(0, 50, size=len(dates))})
output:
date value
0 2015-06-25 39.496971
1 2015-06-26 29.947877
2 2015-06-27 7.328549
3 2015-06-28 6.738427
import pandas as pd
import numpy as np
beg = '2015-06-25 '
end = '2015-06-28'
index = pd.date_range(beg, end)
values = np.random.rand(len(index))
df = pd.DataFrame(values, index=index)
index
0
2015-06-25 00:00:00
0.865455754057059
2015-06-26 00:00:00
0.959574113247196
2015-06-27 00:00:00
0.8634928651970131
2015-06-28 00:00:00
0.6448579391510935

Pd.Grouper to shift the bin intervals

In pandas 1.2 i would like to possibly adjust origin and offset to have the bins in the below example with exact yyyy-05-15 and yyyy-11-15 6M intervals. Here is the help link to pd.Grouper.
my attempt with the following code
import random
import pandas as pd
n_rows = 600
df = pd.DataFrame({'date': pd.date_range(periods=n_rows, end='2020-04-15'),
'current_start_date': '2020-11-15',
'current_end_date': '2021-05-15',
"a": range(n_rows)})
df[['current_start_date', 'current_end_date']] = df[['current_start_date',
'current_end_date']].apply(pd.to_datetime)
end = df[ 'current_end_date'].iloc[0]
df.groupby(pd.Grouper(freq='6M', key='date', origin=end, label='right', closed='right')).sum()
produces
Out[3]:
a
date
2018-08-31 21
2019-02-28 17557
2019-08-31 51428
2020-02-29 84175
2020-08-31 26519

pandas period_range - gaining access to the

pd.period_range(start='2017-01-01', end='2017-01-01', freq='Q') gives me the following:
PeriodIndex(['2017Q1'], dtype='period[Q-DEC]', freq='Q-DEC')
I would like to gain access to '2017Q1' to put it into a different column in the dataframe.
I have a date column with dates, e.g., 1/1/2017. I have other column where I'd like to put the string calculated by the period range. It seems like that would be an efficient way to update the column with the fiscal quarter. I can't seem to access it. It isn't subscriptable, and I can't get at it even when I assign it to a variable. Just wondering what I'm missing.
pandas.PeriodIndex:
You were almost there. It just needed to be assigned to a column
import numpy as np
from datetime import timedelta, datetime
import pandas as pd
# list of dates - test data
first_date = datetime(2017, 1, 1)
last_date = datetime(2019, 9, 20)
x = 4
list_of_dates = [date for date in np.arange(first_date, last_date, timedelta(days=x)).astype(datetime)]
# create the dataframe
df = pd.DataFrame({'dates': list_of_dates})
dates
2017-01-01
2017-01-05
2017-01-09
2017-01-13
2017-01-17
df['Quarters'] = pd.PeriodIndex(df.dates, freq='Q-DEC')
Output:
print(df.head())
dates Quarters
2017-01-01 2017Q1
2017-01-05 2017Q1
2017-01-09 2017Q1
2017-01-13 2017Q1
2017-01-17 2017Q1
print(df.tail())
dates Quarters
2019-08-31 2019Q3
2019-09-04 2019Q3
2019-09-08 2019Q3
2019-09-12 2019Q3
2019-09-16 2019Q3

Resample Daily Data to Monthly with Pandas (date formatting)

I am trying to resample some data from daily to monthly in a Pandas DataFrame. I am new to pandas and maybe I need to format the date and time first before I can do this, but I am not finding a good tutorial out there on the correct way to work with imported time series data. Everything I find is automatically importing data from Yahoo or Quandl.
Here is what I have in my DataFrame:
dataframe segment screenshot
Here is the code I used to create my DataFrame:
#Import excel file into a Pandas DataFrame
df = pd.read_excel(open('2016_forex_daily_returns.xlsx','rb'), sheetname='Sheet 1')
#Calculate the daily returns
df['daily_ret'] = df['Equity'].pct_change()
# Assume an average annual risk-free rate over the period of 5%
df['excess_daily_ret'] = df['daily_ret'] - 0.05/252
Can someone help me understand what I need to do with the "Date" and "Time" columns in my DataFrame so I can resample?
For create DataFrame is possible use:
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print (df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
I think you can first cast to_datetime column date and then use resample with some aggregating functions like sum or mean:
df.Date = pd.to_datetime(df.Date)
df1 = df.resample('M', on='Date').sum()
print (df1)
Equity excess_daily_ret
Date
2016-01-31 2738.37 0.024252
df2 = df.resample('M', on='Date').mean()
print (df2)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
df3 = df.set_index('Date').resample('M').mean()
print (df3)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
To resample from daily data to monthly, you can use the resample method. Specifically for daily returns, the example below demonstrates a possible solution.
The following data is taken from an analysis performed by AQR. It represents the market daily returns for May, 2019. The following code may be used to construct the data as a pd.DataFrame.
import pandas as pd
dates = pd.DatetimeIndex(['2019-05-01', '2019-05-02', '2019-05-03', '2019-05-06',
'2019-05-07', '2019-05-08', '2019-05-09', '2019-05-10',
'2019-05-13', '2019-05-14', '2019-05-15', '2019-05-16',
'2019-05-17', '2019-05-20', '2019-05-21', '2019-05-22',
'2019-05-23', '2019-05-24', '2019-05-27', '2019-05-28',
'2019-05-29', '2019-05-30', '2019-05-31'],
dtype='datetime64[ns]', name='DATE', freq=None)
daily_returns = array([-7.73787813e-03, -1.73277604e-03, 1.09124031e-02, -3.80437796e-03,
-1.66513456e-02, -1.67262934e-03, -2.77427734e-03, 4.01713274e-03,
-2.50407102e-02, 9.23270367e-03, 5.41897568e-03, 8.65419524e-03,
-6.83456209e-03, -6.54787106e-03, 9.04322511e-03, -4.05811322e-03,
-1.33152640e-02, 2.73398876e-03, -9.52000000e-05, -7.91438809e-03,
-7.16881982e-03, 1.19255102e-03, -1.24209547e-02])
daily_returns = pd.DataFrame(index = index, data= may.values, columns = ["returns"])
Assuming you don't have daily price data, you can resample from daily returns to monthly returns using the following code.
>>> daily_returns.resample("M").apply(lambda x: ((x + 1).cumprod() - 1).last("D"))
-0.06532
If you refer to their monthly dataset, this confirms that the market return for May 2019 was approximated to be -6.52% or -0.06532.
First, concatenate the 'Date' and 'Time' columns with space in between. Then convert that into a DateTime format using pd.to_datetime().
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print(df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
df = df.drop(['Date', 'Time'], axis= 'columns').set_index(pd.to_datetime(df.Date + ' ' + df.Time))
df.index.name = 'Date/Time'
print(df)
Equity
Date/Time
2016-01-03 22:16:22 300.38
2016-01-04 22:16:00 300.65
2016-01-05 14:26:02 301.65
2016-01-06 19:08:13 302.10
2016-01-07 18:39:00 302.55
2016-01-08 22:16:04 308.24
2016-01-11 02:49:39 306.69
2016-01-14 15:46:39 307.93
2016-01-19 15:56:31 308.18
Now you can resample to any format you desire.
I have created a random DataFrame similar to yours here:
import numpy as np
import pandas as pd
dates = [x for x in pd.date_range(end=pd.datetime.today(), periods=1800)]
counts = [x for x in np.random.randint(0, 10000, size=1800)]
df = pd.DataFrame({'dates': dates, 'counts': counts}).set_index('dates')
Here are the procedures to aggregate the sum of counts for each week as an example:
df['week'] = df.index.week
df['year'] = df.index.year
target_df = df.groupby(['year', 'week']).agg({'counts': np.sum})
Where the output of target_df is:
counts
year week
2015 3 29877
4 36859
5 36872
6 36899
7 37769
. . .
. . .
. . .

How to create 'dates' column for using statsmodels.tsa.arima_model

The pandas DataFrame df contains the following info:
df
Date Param1
0 2011-12-21 55.12
1 2011-12-28 70.12
2 2012-01-04 33.73
3 2012-01-11 53.14
4 2012-01-18 74.16
To be able to use ARIMA I convert dates as follows:
dates = dates_from_range('2011m1', length=len(df.Param1))
df.index = dates
df = df[['Param1']]
from statsmodels.tsa.arima_model import ARIMA
res = ARIMA(df, (1,0,0), freq='M').fit()
However, in this case instead of getting correct sequence of dates I get this:
Date Param1
2011-01-31 55.12
2011-02-29 70.12
2011-03-31 33.73
2011-04-30 53.14
2011-05-31 74.16
How to avoid using dates_from_range and construct dates manually?

Categories

Resources