Pd.Grouper to shift the bin intervals - python

In pandas 1.2 i would like to possibly adjust origin and offset to have the bins in the below example with exact yyyy-05-15 and yyyy-11-15 6M intervals. Here is the help link to pd.Grouper.
my attempt with the following code
import random
import pandas as pd
n_rows = 600
df = pd.DataFrame({'date': pd.date_range(periods=n_rows, end='2020-04-15'),
'current_start_date': '2020-11-15',
'current_end_date': '2021-05-15',
"a": range(n_rows)})
df[['current_start_date', 'current_end_date']] = df[['current_start_date',
'current_end_date']].apply(pd.to_datetime)
end = df[ 'current_end_date'].iloc[0]
df.groupby(pd.Grouper(freq='6M', key='date', origin=end, label='right', closed='right')).sum()
produces
Out[3]:
a
date
2018-08-31 21
2019-02-28 17557
2019-08-31 51428
2020-02-29 84175
2020-08-31 26519

Related

Time elapsed since first log for each user

I'm trying to calculate the time difference between all the logs of a user and the first log of that same user. There are users with several logs.
The dataframe looks like this:
16 00000021601 2022-08-23 17:12:04
20 00000021601 2022-08-23 17:12:04
21 00000031313 2022-10-22 11:16:57
22 00000031313 2022-10-22 12:16:44
23 00000031313 2022-10-22 14:39:07
24 00000065137 2022-05-06 11:51:33
25 00000065137 2022-05-06 11:51:33
I know that I could do df['DELTA'] = df.groupby('ID')['DATE'].shift(-1) - df['DATE'] to get the difference between consecutive dates for each user, but since something like iat[0] doesn't work in this case I don't know how to get the difference in relation to the first date.
You can try this code
import pandas as pd
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
df.groupby('id').apply(lambda x: x['dates'] - x.iloc[0, 0])
Out:
id
1 0 0 days 00:00:00
1 0 days 00:00:00
2 59 days 18:04:53
2 3 0 days 00:00:00
4 0 days 02:22:23
5 -170 days +23:34:49
6 -170 days +23:34:49
Name: dates, dtype: timedelta64[ns]
If you dataframe is large and apply took a long time you can try use parallel-pandas. It's very simple
import pandas as pd
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8)
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
#p_apply is parallel analogue of apply method
df.groupby('id').p_apply(lambda x: x['dates'] - x.iloc[0, 0])
It will be 5-10 time faster

Pandas: Average values in groups of X days

I want to Average the value of AAPL.High in groups of 10 days (JAN/01 to JAN/10), using the day number 10 as the reference number.
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
The Idea of the code is approximately:
df1['demand'] = df1.groupby(['supplier_name', 'date'])['difference'].transform('mean').fillna(0)
Simple case of define index as the dates then just use resample()
import pandas as pd
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/finance-charts-apple.csv')
df.index = pd.to_datetime(df["Date"])
df.resample("10d").agg({"AAPL.High":np.mean})
output
AAPL.High
Date
2015-02-17 130.657501
2015-02-27 129.675001
2015-03-09 126.661251
2015-03-19 127.134283
2015-03-29 126.533333
... ...
2017-01-07 119.532001
2017-01-17 120.841248
2017-01-27 125.740000
2017-02-06 133.172500
2017-02-16 135.899994

Generate random timeseries data with dates

I am trying to generate random data(integers) with dates so that I can practice pandas data analytics commands on it and plot time series graphs.
temp depth acceleration
2019-01-1 -0.218062 -1.215978 -1.674843
2019-02-1 -0.465085 -0.188715 0.241956
2019-03-1 -1.464794 -1.354594 0.635196
2019-04-1 0.103813 0.194349 -0.450041
2019-05-1 0.437921 0.073829 1.346550
Is there any random dataframe generator that can generate something like this with each date having a gap of one month?
You can either use pandas.util.testing
import pandas.util.testing as testing
import numpy as np
np.random.seed(1)
testing.N, testing.K = 5, 3 # Setting the rows and columns of the desired data
print testing.makeTimeDataFrame(freq='MS')
>>>
A B C
2000-01-01 -0.488392 0.429949 -0.723245
2000-02-01 1.247192 -0.513568 -0.512677
2000-03-01 0.293828 0.284909 1.190453
2000-04-01 -0.326079 -1.274735 -0.008266
2000-05-01 -0.001980 0.745803 1.519243
Or, if you need more control over the random values being generated, you can use something like
import numpy as np
import pandas as pd
np.random.seed(1)
rows,cols = 5,3
data = np.random.rand(rows,cols) # You can use other random functions to generate values with constraints
tidx = pd.date_range('2019-01-01', periods=rows, freq='MS') # freq='MS'set the frequency of date in months and start from day 1. You can use 'T' for minutes and so on
data_frame = pd.DataFrame(data, columns=['a','b','c'], index=tidx)
print data_frame
>>>
a b c
2019-01-01 0.992856 0.217750 0.538663
2019-02-01 0.189226 0.847022 0.156730
2019-03-01 0.572417 0.722094 0.868219
2019-04-01 0.023791 0.653147 0.857148
2019-05-01 0.729236 0.076817 0.743955
Use numpy.random.rand or numpy.random.randint functions with DataFrame constructor:
np.random.seed(2019)
N = 10
rng = pd.date_range('2019-01-01', freq='MS', periods=N)
df = pd.DataFrame(np.random.rand(N, 3), columns=['temp','depth','acceleration'], index=rng)
print (df)
temp depth acceleration
2019-01-01 0.903482 0.393081 0.623970
2019-02-01 0.637877 0.880499 0.299172
2019-03-01 0.702198 0.903206 0.881382
2019-04-01 0.405750 0.452447 0.267070
2019-05-01 0.162865 0.889215 0.148476
2019-06-01 0.984723 0.032361 0.515351
2019-07-01 0.201129 0.886011 0.513620
2019-08-01 0.578302 0.299283 0.837197
2019-09-01 0.526650 0.104844 0.278129
2019-10-01 0.046595 0.509076 0.472426
If need integers:
np.random.seed(2019)
N = 10
rng = pd.date_range('2019-01-01', freq='MS', periods=N)
df = pd.DataFrame(np.random.randint(20, size=(10, 3)),
columns=['temp','depth','acceleration'],
index=rng)
print (df)
temp depth acceleration
2019-01-01 8 18 5
2019-02-01 15 12 10
2019-03-01 16 16 7
2019-04-01 5 19 12
2019-05-01 16 18 5
2019-06-01 16 15 1
2019-07-01 14 12 10
2019-08-01 0 11 18
2019-09-01 15 19 1
2019-10-01 3 16 18

Wrong labels when plotting a time series pandas dataframe with matplotlib

I am working with a dataframe containing data of 1 week.
y
ds
2017-08-31 10:15:00 1.000000
2017-08-31 10:20:00 1.049107
2017-08-31 10:25:00 1.098214
...
2017-09-07 10:05:00 99.901786
2017-09-07 10:10:00 99.950893
2017-09-07 10:15:00 100.000000
I create a new index by combining the weekday and time i.e.
y
dayIndex
4 - 10:15 1.000000
4 - 10:20 1.049107
4 - 10:25 1.098214
...
4 - 10:05 99.901786
4 - 10:10 99.950893
4 - 10:15 100.000000
The plot of this data is the following:
The plot is correct as the labels reflect the data in the dataframe. However, when zooming in, the labels do not seem correct as they no longer correspond to their original values:
What is causing this behavior?
Here is the code to reproduce this:
import datetime
import numpy as np
import pandas as pd
dtnow = datetime.datetime.now()
dindex = pd.date_range(dtnow , dtnow + datetime.timedelta(7), freq='5T')
data = np.linspace(1,100, num=len(dindex))
df = pd.DataFrame({'ds': dindex, 'y': data})
df = df.set_index('ds')
df = df.resample('5T').mean()
df['dayIndex'] = df.index.strftime('%w - %H:%M')
df= df.set_index('dayIndex')
df.plot()
"What is causing this behavior?"
The formatter of an axes of a pandas dates plot is a matplotlib.ticker.FixedFormatter (see e.g.
print plt.gca().xaxis.get_major_formatter()). "Fixed" means that it formats the ith tick (if shown) with some constant string.
When zooming or panning, you shift the tick locations, but not the format strings.
In short: A pandas date plot may not be the best choice for interactive plots.
Solution
A solution is usually to use matplotlib formatters directly. This requires the dates to be datetime objects (which can be ensured using df.index.to_pydatetime()).
import datetime
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates
dtnow = datetime.datetime.now()
dindex = pd.date_range(dtnow , dtnow + datetime.timedelta(7), freq='110T')
data = np.linspace(1,100, num=len(dindex))
df = pd.DataFrame({'ds': dindex, 'y': data})
df = df.set_index('ds')
df.index.to_pydatetime()
df.plot(marker="o")
plt.gca().xaxis.set_major_formatter(matplotlib.dates.DateFormatter('%w - %H:%M'))
plt.show()

Resample Daily Data to Monthly with Pandas (date formatting)

I am trying to resample some data from daily to monthly in a Pandas DataFrame. I am new to pandas and maybe I need to format the date and time first before I can do this, but I am not finding a good tutorial out there on the correct way to work with imported time series data. Everything I find is automatically importing data from Yahoo or Quandl.
Here is what I have in my DataFrame:
dataframe segment screenshot
Here is the code I used to create my DataFrame:
#Import excel file into a Pandas DataFrame
df = pd.read_excel(open('2016_forex_daily_returns.xlsx','rb'), sheetname='Sheet 1')
#Calculate the daily returns
df['daily_ret'] = df['Equity'].pct_change()
# Assume an average annual risk-free rate over the period of 5%
df['excess_daily_ret'] = df['daily_ret'] - 0.05/252
Can someone help me understand what I need to do with the "Date" and "Time" columns in my DataFrame so I can resample?
For create DataFrame is possible use:
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print (df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
I think you can first cast to_datetime column date and then use resample with some aggregating functions like sum or mean:
df.Date = pd.to_datetime(df.Date)
df1 = df.resample('M', on='Date').sum()
print (df1)
Equity excess_daily_ret
Date
2016-01-31 2738.37 0.024252
df2 = df.resample('M', on='Date').mean()
print (df2)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
df3 = df.set_index('Date').resample('M').mean()
print (df3)
Equity excess_daily_ret
Date
2016-01-31 304.263333 0.003032
To resample from daily data to monthly, you can use the resample method. Specifically for daily returns, the example below demonstrates a possible solution.
The following data is taken from an analysis performed by AQR. It represents the market daily returns for May, 2019. The following code may be used to construct the data as a pd.DataFrame.
import pandas as pd
dates = pd.DatetimeIndex(['2019-05-01', '2019-05-02', '2019-05-03', '2019-05-06',
'2019-05-07', '2019-05-08', '2019-05-09', '2019-05-10',
'2019-05-13', '2019-05-14', '2019-05-15', '2019-05-16',
'2019-05-17', '2019-05-20', '2019-05-21', '2019-05-22',
'2019-05-23', '2019-05-24', '2019-05-27', '2019-05-28',
'2019-05-29', '2019-05-30', '2019-05-31'],
dtype='datetime64[ns]', name='DATE', freq=None)
daily_returns = array([-7.73787813e-03, -1.73277604e-03, 1.09124031e-02, -3.80437796e-03,
-1.66513456e-02, -1.67262934e-03, -2.77427734e-03, 4.01713274e-03,
-2.50407102e-02, 9.23270367e-03, 5.41897568e-03, 8.65419524e-03,
-6.83456209e-03, -6.54787106e-03, 9.04322511e-03, -4.05811322e-03,
-1.33152640e-02, 2.73398876e-03, -9.52000000e-05, -7.91438809e-03,
-7.16881982e-03, 1.19255102e-03, -1.24209547e-02])
daily_returns = pd.DataFrame(index = index, data= may.values, columns = ["returns"])
Assuming you don't have daily price data, you can resample from daily returns to monthly returns using the following code.
>>> daily_returns.resample("M").apply(lambda x: ((x + 1).cumprod() - 1).last("D"))
-0.06532
If you refer to their monthly dataset, this confirms that the market return for May 2019 was approximated to be -6.52% or -0.06532.
First, concatenate the 'Date' and 'Time' columns with space in between. Then convert that into a DateTime format using pd.to_datetime().
df = pd.read_excel('2016_forex_daily_returns.xlsx', sheetname='Sheet 1')
print(df)
Date Time Equity
0 2016-01-03 22:16:22 300.38
1 2016-01-04 22:16:00 300.65
2 2016-01-05 14:26:02 301.65
3 2016-01-06 19:08:13 302.10
4 2016-01-07 18:39:00 302.55
5 2016-01-08 22:16:04 308.24
6 2016-01-11 02:49:39 306.69
7 2016-01-14 15:46:39 307.93
8 2016-01-19 15:56:31 308.18
df = df.drop(['Date', 'Time'], axis= 'columns').set_index(pd.to_datetime(df.Date + ' ' + df.Time))
df.index.name = 'Date/Time'
print(df)
Equity
Date/Time
2016-01-03 22:16:22 300.38
2016-01-04 22:16:00 300.65
2016-01-05 14:26:02 301.65
2016-01-06 19:08:13 302.10
2016-01-07 18:39:00 302.55
2016-01-08 22:16:04 308.24
2016-01-11 02:49:39 306.69
2016-01-14 15:46:39 307.93
2016-01-19 15:56:31 308.18
Now you can resample to any format you desire.
I have created a random DataFrame similar to yours here:
import numpy as np
import pandas as pd
dates = [x for x in pd.date_range(end=pd.datetime.today(), periods=1800)]
counts = [x for x in np.random.randint(0, 10000, size=1800)]
df = pd.DataFrame({'dates': dates, 'counts': counts}).set_index('dates')
Here are the procedures to aggregate the sum of counts for each week as an example:
df['week'] = df.index.week
df['year'] = df.index.year
target_df = df.groupby(['year', 'week']).agg({'counts': np.sum})
Where the output of target_df is:
counts
year week
2015 3 29877
4 36859
5 36872
6 36899
7 37769
. . .
. . .
. . .

Categories

Resources