I am trying to resample some data from daily to monthly in a Pandas DataFrame. I am new to pandas and maybe I need to format the date and time first before I can do this, but I am not finding a good tutorial out there on the correct way to work with imported time series data. Everything I find is automatically importing data from Yahoo or Quandl.
Here is what I have in my DataFrame:
dataframe segment screenshot
Here is the code I used to create my DataFrame:
#Import excel file into a Pandas DataFrame
df = pd.read_excel(open('2016_forex_daily_returns.xlsx','rb'), sheetname='Sheet 1')
#Calculate the daily returns
df['daily_ret'] = df['Equity'].pct_change()
# Assume an average annual risk-free rate over the period of 5%
df['excess_daily_ret'] = df['daily_ret'] - 0.05/252
Can someone help me understand what I need to do with the "Date" and "Time" columns in my DataFrame so I can resample?
For create DataFrame is possible use:
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print (df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
I think you can first cast to_datetime column date and then use resample with some aggregating functions like sum or mean:
df.Date = pd.to_datetime(df.Date)
df1 = df.resample('M', on='Date').sum()
print (df1)
Equity excess_daily_ret
Date
2016-01-31 2738.37 0.024252
df2 = df.resample('M', on='Date').mean()
print (df2)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
df3 = df.set_index('Date').resample('M').mean()
print (df3)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
To resample from daily data to monthly, you can use the resample method. Specifically for daily returns, the example below demonstrates a possible solution.
The following data is taken from an analysis performed by AQR. It represents the market daily returns for May, 2019. The following code may be used to construct the data as a pd.DataFrame.
import pandas as pd
dates = pd.DatetimeIndex(['2019-05-01', '2019-05-02', '2019-05-03', '2019-05-06',
'2019-05-07', '2019-05-08', '2019-05-09', '2019-05-10',
'2019-05-13', '2019-05-14', '2019-05-15', '2019-05-16',
'2019-05-17', '2019-05-20', '2019-05-21', '2019-05-22',
'2019-05-23', '2019-05-24', '2019-05-27', '2019-05-28',
'2019-05-29', '2019-05-30', '2019-05-31'],
dtype='datetime64[ns]', name='DATE', freq=None)
daily_returns = array([-7.73787813e-03, -1.73277604e-03, 1.09124031e-02, -3.80437796e-03,
-1.66513456e-02, -1.67262934e-03, -2.77427734e-03, 4.01713274e-03,
-2.50407102e-02, 9.23270367e-03, 5.41897568e-03, 8.65419524e-03,
-6.83456209e-03, -6.54787106e-03, 9.04322511e-03, -4.05811322e-03,
-1.33152640e-02, 2.73398876e-03, -9.52000000e-05, -7.91438809e-03,
-7.16881982e-03, 1.19255102e-03, -1.24209547e-02])
daily_returns = pd.DataFrame(index = index, data= may.values, columns = ["returns"])
Assuming you don't have daily price data, you can resample from daily returns to monthly returns using the following code.
>>> daily_returns.resample("M").apply(lambda x: ((x + 1).cumprod() - 1).last("D"))
-0.06532
If you refer to their monthly dataset, this confirms that the market return for May 2019 was approximated to be -6.52% or -0.06532.
First, concatenate the 'Date' and 'Time' columns with space in between. Then convert that into a DateTime format using pd.to_datetime().
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print(df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
df = df.drop(['Date', 'Time'], axis= 'columns').set_index(pd.to_datetime(df.Date + ' ' + df.Time))
df.index.name = 'Date/Time'
print(df)
Equity
Date/Time
2016-01-03 22:16:22 300.38
2016-01-04 22:16:00 300.65
2016-01-05 14:26:02 301.65
2016-01-06 19:08:13 302.10
2016-01-07 18:39:00 302.55
2016-01-08 22:16:04 308.24
2016-01-11 02:49:39 306.69
2016-01-14 15:46:39 307.93
2016-01-19 15:56:31 308.18
Now you can resample to any format you desire.
I have created a random DataFrame similar to yours here:
import numpy as np
import pandas as pd
dates = [x for x in pd.date_range(end=pd.datetime.today(), periods=1800)]
counts = [x for x in np.random.randint(0, 10000, size=1800)]
df = pd.DataFrame({'dates': dates, 'counts': counts}).set_index('dates')
Here are the procedures to aggregate the sum of counts for each week as an example:
df['week'] = df.index.week
df['year'] = df.index.year
target_df = df.groupby(['year', 'week']).agg({'counts': np.sum})
Where the output of target_df is:
counts
year week
2015 3 29877
4 36859
5 36872
6 36899
7 37769
. . .
. . .
. . .
Related
I'm working with the following dataset:
Date
2016-01-04
2016-01-05
2016-01-06
2016-01-07
2016-01-08
and a list holidays = ['2016-01-01','2016-01-18'....'2017-11-23','2017-12-25']
Objective: Create a column indicating whether a particular date is within +- 7 days of any holiday present in the list.
Mock output:
Date
Within a week of Holiday
2016-01-04
1
2016-01-05
1
2016-01-06
1
2016-01-07
1
2016-01-08
0
I'm working with a lot of date records and thus trying to find a quick(most optimized) way to do this.
My Current Solution:
One way I figured to do this quickly would be to create another list with only the unique dates for my desired duration(say 2 years). This way, I can implement a simple solution with 2 for loops to check if a date is within +-7days of a holiday, and it wouldn't be computationally heavy as both lists would be relatively small(730 unique dates and ~20 dates in the holiday list).
Once I have my desired list of dates, all I have to do is run a single check on my 'Date' column to see if that date is a part of this new list I created. However, any suggestions to do this even quicker?
Turn holidays into a DataFrame and then merge_asof with a tolerance of 6 days:
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
Complete Working Example:
import numpy as np
import pandas as pd
holidays = pd.DataFrame(pd.to_datetime(['2016-01-01', '2016-01-18']),
columns=['Holiday'])
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
new_df = pd.merge_asof(df, holidays, left_on='Date', right_on='Holiday',
tolerance=pd.Timedelta(days=6))
new_df['Holiday'] = np.where(new_df['Holiday'].notnull(), 1, 0)
new_df = new_df.rename(columns={'Holiday': 'Within a week of Holiday'})
print(new_df)
new_df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Or turn Holdiays into a np datetime array then broadcast subtraction across the 'Date' Column, compare the abs to 7 days, and see if there are any matches:
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
Complete Working Example:
import numpy as np
import pandas as pd
holidays = np.array(['2016-01-01', '2016-01-18']).astype('datetime64')
df = pd.DataFrame({
'Date': ['2016-01-04', '2016-01-05', '2016-01-06', '2016-01-07',
'2016-01-08']
})
df['Date'] = pd.to_datetime(df['Date'])
df['Within a week of Holiday'] = (
abs(df['Date'].values - holidays[:, None]) < pd.Timedelta(days=7)
).any(axis=0).astype(int)
print(df)
df:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
make a function that calculate date with +- 7 days and check if calculated date is in holidays so return True else False and apply that function to Data frame
import datetime
import pandas as pd
holidays = ['2016-01-01','2016-01-18','2017-11-23','2017-12-25']
def holiday_present(date):
date = datetime.datetime.strptime(date, '%Y-%m-%d')
for i in range(-7,7):
datte = (date - datetime.timedelta(days=i)).strftime('%Y-%m-%d')
if datte in holidays:
return True
return False
data = {
"Date":[
"2016-01-04",
"2016-01-05",
"2016-01-06",
"2016-01-07",
"2016-01-08"]
}
df= pd.DataFrame(data)
df["Within a week of Holiday"] = df["Date"].apply(holiday_present).astype(int)
Output:
Date Within a week of Holiday
0 2016-01-04 1
1 2016-01-05 1
2 2016-01-06 1
3 2016-01-07 1
4 2016-01-08 0
Try this:
Sample:
import pandas as pd
df = pd.DataFrame({'Date': {0: '2016-01-04',
1: '2016-01-05',
2: '2016-01-06',
3: '2016-01-07',
4: '2016-01-08'}})
Code:
def get_date_range(holidays):
h = [pd.to_datetime(x) for x in holidays]
h = [pd.date_range(x - pd.DateOffset(6), x + pd.DateOffset(6)) for x in h]
h = [x.strftime('%Y-%m-%d') for y in h for x in y]
return h
df['Within a week of Holiday'] = df['Date'].isin(get_date_range(holidays))*1
Result:
Out[141]:
0 1
1 1
2 1
3 1
4 0
Name: Within a week of Holiday, dtype: int32
What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64
I want to Average the value of AAPL.High in groups of 10 days (JAN/01 to JAN/10), using the day number 10 as the reference number.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
The Idea of the code is approximately:
df1['demand'] = df1.groupby(['supplier_name', 'date'])['difference'].transform('mean').fillna(0)
Simple case of define index as the dates then just use resample()
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
df.index = pd.to_datetime(df["Date"])
df.resample("10d").agg({"AAPL.High":np.mean})
output
AAPL.High
Date
2015-02-17 130.657501
2015-02-27 129.675001
2015-03-09 126.661251
2015-03-19 127.134283
2015-03-29 126.533333
... ...
2017-01-07 119.532001
2017-01-17 120.841248
2017-01-27 125.740000
2017-02-06 133.172500
2017-02-16 135.899994
How to sort a python data frame according to dates in the format that can be seen on the image. The output that I want to receive is the same data frame but at index 0 I would have January 2013 and the corresponding amount and at index 1 I would have February 2013 etc.
import pandas as pd
df = pd.DataFrame( {'Amount':['54241.25','54008.83','54008.82'] ,
'Date':['05/01/2015','05/01/2017','06/01/2017']})
df['Date'] =pd.to_datetime(df.Date)
df.sort_values('Date', inplace=True)
You just need to convert your Date column to a datetime, then you can sort the dataframe by that column
import pandas as pd
df = pd.DataFrame({'Date': ['05-2016', '05-2017', '06-2017', '01-2017', '02-2017'],
'Amount': [2,5,6,3,2]})
df['Date'] = pd.to_datetime(df['Date'], format='%m-%Y')
df = df.sort_values('Date').reset_index(drop=True)
Which gives:
Date Amount
0 2016-05-01 2
1 2017-01-01 3
2 2017-02-01 2
3 2017-05-01 5
4 2017-06-01 6
I have stock data downloaded from yahoo finance. I want to pickup data in the row corresponding to monthly start and month end. I am trying to do it with python pandas data frame. But I am not getting correct method to get the starting & ending of the month. will be great full if somebody can help me in solving this.
Please note that if 1st of the month is holiday and there is no data for that, I need to pick up 2nd day's data. Same rule applies to last of the month also. Thanks in advance.
Example data is
2016-01-05,222.80,222.80,217.00,217.75,15074800,217.75
2016-01-04,226.95,226.95,220.05,220.70,14092000,220.70
2015-12-31,225.95,226.55,224.00,224.45,11558300,224.45
2015-12-30,229.00,229.70,224.85,225.80,11702800,225.80
2015-12-29,228.85,229.95,227.50,228.20,7263200,228.20
2015-12-28,229.05,229.95,228.00,228.90,8756800,228.90
........
........
2015-12-04,240.00,242.15,238.05,241.10,11115100,241.10
2015-12-03,244.15,244.50,240.40,241.10,7155600,241.10
2015-12-02,250.55,250.65,243.75,244.60,10881700,244.60
2015-11-30,249.65,253.00,245.00,250.20,12865400,250.20
2015-11-27,243.00,250.50,242.80,249.70,15149900,249.70
2015-11-26,241.95,244.90,241.00,242.50,13629800,242.50
First, you should convert your date column to datetime format, then group by month, then sort groupby Series by date and take the first/last from it using head/tail methods, like so:
In [37]: df
Out[37]:
0 1 2 3 4 5 6
0 2016-01-05 222.80 222.80 217.00 217.75 15074800 217.75
1 2016-01-04 226.95 226.95 220.05 220.70 14092000 220.70
2 2015-12-31 225.95 226.55 224.00 224.45 11558300 224.45
3 2015-12-30 229.00 229.70 224.85 225.80 11702800 225.80
4 2015-12-29 228.85 229.95 227.50 228.20 7263200 228.20
5 2015-12-28 229.05 229.95 228.00 228.90 8756800 228.90
In [25]: import datetime
In [29]: df[0] = df[0].apply(lambda x: datetime.datetime.strptime(x, '%Y-%m-%d')
)
In [36]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).head(1))
Out[36]:
0 1 2 3 4 5 6
0
1 1 2016-01-04 226.95 226.95 220.05 220.7 14092000 220.7
12 5 2015-12-28 229.05 229.95 228.00 228.9 8756800 228.9
In [38]: df.groupby(df[0].apply(lambda x: x.month)).apply(lambda x: x.sort_value
s(0).tail(1))
Out[38]:
0 1 2 3 4 5 6
0
1 0 2016-01-05 222.80 222.80 217.0 217.75 15074800 217.75
12 2 2015-12-31 225.95 226.55 224.0 224.45 11558300 224.45
You can merge the result dataframes, using pd.concat()
For the first / last day of each month, you can use .resample() with 'BMS' and 'BM' for Business Month (Start) like so (using pandas 0.18 syntax):
df.resample('BMS').first()
df.resample('BM').last()
This assumes that your data have a DateTimeIndex as usual when downloaded from yahoo using pandas_datareader:
from datetime import datetime
from pandas_datareader.data import DataReader
df = DataReader('FB', 'yahoo', datetime(2015, 1, 1), datetime(2015, 3, 31))['Open']
df.head()
Date
2015-01-02 78.580002
2015-01-05 77.980003
2015-01-06 77.230003
2015-01-07 76.760002
2015-01-08 76.739998
Name: Open, dtype: float64
df.tail()
Date
2015-03-25 85.500000
2015-03-26 82.720001
2015-03-27 83.379997
2015-03-30 83.809998
2015-03-31 82.900002
Name: Open, dtype: float64
do:
df.resample('BMS').first()
Date
2015-01-01 78.580002
2015-02-02 76.110001
2015-03-02 79.000000
Freq: BMS, Name: Open, dtype: float64
and
df.resample('BM').last()
to get:
Date
2015-01-30 78.000000
2015-02-27 80.680000
2015-03-31 82.900002
Freq: BM, Name: Open, dtype: float64
Assuming you have downloaded data from Yahoo:
> import pandas.io.data as web
> import datetime
> start = datetime.datetime(2016,1,1)
> end = datetime.datetime(2016,5,1)
> df = web.DataReader("AAPL", "yahoo", start, end)
You simply pick the month end and start rows with:
df[df.index.is_month_end]
df[df.index.is_month_start]
If you want to access a specific row, like the first row of the first starting day of the selected starting days, you simply do:
df[df.index.is_month_start].ix[0]