I'm trying to figure out how to take a dataframe representing players in a game, the dataframe has unique users and records of each day the particular user has been active.
I am trying to get the average playtime and average moves for each week in the various users lifetime.
(Week is defined by a user's first record, i.e. if a user's first record is 3rd of January, their 1st week starts then and the 2nd week start the 10th of January).
Example
userid date secondsPlayed movesMade
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-28 413.88188 85
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-01 82.67343 15
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-05 236.73809 39
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-10 112.69112 29
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-11 211.42790 44
-----------------------------------CONT----------------------------------
++/8ij1h8378h123123koF3oer1 2016-05-05 200.73809 11
++/8ij1h8378h123123koF3oer1 2016-05-10 51.69112 14
++/8ij1h8378h123123koF3oer1 2016-05-14 65.42790 53
The end result for this would be the following table:
userid date secondsPlayed_w movesMade_w
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-28 496.55531 100
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-05 236.73809 68
-----------------------------------CONT----------------------------------
++/8ij1h8378h123123koF3oer1 2016-05-05 252.42921 25
++/8ij1h8378h123123koF3oer1 2016-05-12 65.42790 53
Failed attempt #1:
So far I've tried doing a lot of different things, but the most useful dataframe I've managed to create was the following:
df_grouped = df.groupby('userid').apply(lambda x: x.set_index('date').resample('1D').first().fillna(0))
df_result = df_grouped.groupby(level=0)['secondsPlayed'].apply(lambda x: x.rolling(min_periods=1, window=7).mean()).reset_index(name='secondsPlayed_week')
Which is a very slow and wasteful computation, but nonetheless can be used as a intermediate step.
userid date secondsPlayed_w
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-28 4.138819e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-29 2.069409e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-04-30 1.379606e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-01 1.241388e+02
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-02 9.931106e+01
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-03 8.275922e+01
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-04 7.093647e+01
++/acsbP2NFC2BvgG1BzySv5jko= 2016-05-05 4.563022e+01
Failed attempt #2:
df_result = (df
.reset_index()
.set_index("date")
.groupby(pd.Grouper(freq='W'))).agg({"userid":"first", "secondsPlayed":"sum", "movesUsed":"sum"})
.reset_index()
Which gave me the following dataframe, which has the fault of not being grouped by userids (the NaN problem is easily resolved).
date userid secondsPlayed_w movesMade_w
2016-04-10 +1kexX0Yk2Su639WaRKARcwjq5g= 2.581356e+03 320
2016-04-17 +1kexX0Yk2Su639WaRKARcwjq5g= 4.040738e+03 615
2016-04-24 NaN 0.000000e+00 0
2016-05-01 ++RBPf9KdTK6pTN+lKZHDLCXg10= 1.644130e+05 17453
2016-05-08 ++DndI7do036eqYh9iW7vekAnx0= 3.775905e+05 31997
2016-05-15 ++NjKpr/vyxNCiYcmeFK9qSqD9o= 4.993430e+05 34706
2016-05-22 ++RBPf9KdTK6pTN+lKZHDLCXg10= 3.940408e+05 23779
Immediate thought:
Can this problem be solved by using a groupby that groups by two columns. But I'm not at all sure how to go about that with this particular problem.
You can create a newid help groupby
df.date=pd.to_datetime(df.date)
df['Newweeknumber']=df.groupby('userid').date.diff().dt.days.cumsum().fillna(0)//7# get the week number by the first date of each id
df.groupby(['userid','Newweeknumber']).agg({"userid":"first", "secondsPlayed":"sum", "movesMade":"sum"})
Update
Try
df1 = pd.DataFrame(index=pd.date_range('2015-04-24', periods = 50)).assign(value=1)
df2 = pd.DataFrame(index=pd.date_range('2015-04-28', periods = 50)).assign(value=1)
df3 = pd.concat([df1,df2], keys=['A','B'])
df3 = df3.rename_axis(['user','date']).reset_index()
df3.groupby('user').apply(lambda x: x.resample('7D', on='date').sum())
Output:
value
user date
A 2015-04-24 7
2015-05-01 7
2015-05-08 7
2015-05-15 7
2015-05-22 7
2015-05-29 7
2015-06-05 7
2015-06-12 1
B 2015-04-28 7
2015-05-05 7
2015-05-12 7
2015-05-19 7
2015-05-26 7
2015-06-02 7
2015-06-09 7
2015-06-16 1
Related
I have a time series that looks like this:
value date
63.85 2017-01-15
63.95 2017-01-22
63.88 2017-01-29
64.02 2017-02-05
63.84 2017-02-12
62.13 2017-03-05
65.36 2017-03-25
66.45 2017-04-25
And I would like to reverse the order of the rows so they look like this:
value date
66.45 2000-01-01
65.36 2000-02-01
62.13 2000-02-20
63.84 2000-03-12
64.02 2000-03-19
63.88 2000-03-26
63.95 2000-04-02
63.85 2000-04-09
As you can see, the "value" column requires to simply flip the row values, but for the date column what I would like to do is keep the same "difference in days" between dates. It doesn't really matter what the start date value is as long as the difference in days is flipped correctly too. In the second dataframe of the example, the start date value is 2000-01-01 and the second value is 2020-02-01, which is 31 days later than the first date. This "day difference" of 31 days is the same one as the last (2017-04-25) and penultimate date (2017-03-25) of the first dataframe. And, the same for the second (2000-02-01) and the third value (2000-02-20) of the second dataframe: the "difference in days" is 20 days, the same one between the penultimate date (2017-03-25) and the antepenultimate date (2017-03-05) of the first dataframe. And so on.
I believe that the steps needed to do this would require to first calculate this "day differences", but I would like to know how to do it efficiently. Thank you :)
NumPy has support for this via its datetime and timedelta data types.
First you reverse both columns in your time series as follows:
import pandas as pd
import numpy as np
df2 = df
df2 = df2.iloc[::-1]
df2
where df is your original time series data and df2 (shown below) is the reversed time series.
value date
7 66.45 2017-04-25
6 65.36 2017-03-25
5 62.13 2017-03-05
4 63.84 2017-02-12
3 64.02 2017-02-05
2 63.88 2017-01-29
1 63.95 2017-01-22
0 63.85 2017-01-15
Next you find the day differences and store them as timedelta objects:
dates_np = np.array(df2.date).astype(np.datetime64) # Convert dates to np.datetime64 ojects
timeDeltas = np.insert(abs(np.diff(dates_np)), 0, 0) # np.insert is to account for -1 length during np.diff call
d2 = {'value': df_reversed.value, 'day_diff': timeDeltas} # Create new dataframe (df3)
df3 = pd.DataFrame(data=d2)
df3
where df3 (the day differences table) looks like this:
value day_diff
7 66.45 0 days
6 65.36 31 days
5 62.13 20 days
4 63.84 21 days
3 64.02 7 days
2 63.88 7 days
1 63.95 7 days
0 63.85 7 days
Lastly, to get back to dates accumulating from a start data, you do the following:
startDate = np.datetime64('2000-01-01') # You can change this if you like
df4 = df2 # Copy coumn data from df2
df4.date = np.array(np.cumsum(df3.day_diff) + startDate # np.cumsum accumulates the day_diff sum
df4
where df4 (the start date accumulation) looks like this:
value date
7 66.45 2000-01-01
6 65.36 2000-02-01
5 62.13 2000-02-21
4 63.84 2000-03-13
3 64.02 2000-03-20
2 63.88 2000-03-27
1 63.95 2000-04-03
0 63.85 2000-04-10
I noticed there is a 1-day discrepancy with my final table, however this is most likely due to the implementation of timedelta inclusivity/exluclusivity.
Here's how I did it:
Creating the DataFrame:
value = [63.85, 63.95, 63.88, 64.02, 63.84, 62.13, 65.36, 66.45]
date = ["2017-01-15", "2017-01-22", "2017-01-29", "2017-02-05", "2017-02-12", "2017-03-05", "2017-03-25", "2017-04-25",]
df = pd.DataFrame({"value": value, "date": date})
Creating a second DataFrame with the values reversed and converting the date column to datetime
new_df = df.astype({'date': 'datetime64'})
new_df.sort_index(ascending=False, inplace=True, ignore_index=True)
new_df
value date
0 66.45 2017-04-25
1 65.36 2017-03-25
2 62.13 2017-03-05
3 63.84 2017-02-12
4 64.02 2017-02-05
5 63.88 2017-01-29
6 63.95 2017-01-22
7 63.85 2017-01-15
I then used pandas.Series.diff to calculate the time delta between each row and converted those values to absolute values.
time_delta_series = new_df['date'].diff().abs()
time_delta_series
0 NaT
1 31 days
2 20 days
3 21 days
4 7 days
5 7 days
6 7 days
7 7 days
Name: date, dtype: timedelta64[ns]
Then you need to convert those values to a cumulative time delta.
But to use the cumsum() method you need to first remove the missing values (NaT).
time_delta_series = time_delta_series.fillna(pd.Timedelta(seconds=0)).cumsum()
time_delta_series
0 0 days
1 31 days
2 51 days
3 72 days
4 79 days
5 86 days
6 93 days
7 100 days
Name: date, dtype: timedelta64[ns
Then you can create your starting date and create the date column for the second DataFrame we created before:
from datetime import date
start = date(2000, 1, 1)
new_df['date'] = start
new_df['date'] = new_df['date'] + time_delta_series
new_df
value date
0 66.45 2000-01-01
1 65.36 2000-02-01
2 62.13 2000-02-21
3 63.84 2000-03-13
4 64.02 2000-03-20
5 63.88 2000-03-27
6 63.95 2000-04-03
7 63.85 2000-04-10
Is there a (more) convenient/efficient method to calculate the number of business days between to dates using pandas?
I could do
len(pd.bdate_range(start='2018-12-03',end='2018-12-14'))-1 # minus one only if end date is a business day
but for longer distances between the start and end day this seems rather inefficient.
There are a couple of suggestion how to use the BDay offset object, but they all seem to refer to the creation of dateranges or something similar.
I am thinking more in terms of a Timedelta object that is represented in business-days.
Say I have two series,s1 and s2, containing datetimes. If pandas had something along the lines of
s1.dt.subtract(s2,freq='B')
# giving a new series containing timedeltas where the number of days calculated
# use business days only
would be nice.
(numpy has a busday_count() method. But I would not want to convert my pandas Timestamps to numpy, as this can get messy.)
I think np.busday_count here is good idea, also convert to numpy arrays is not necessary:
s1 = pd.Series(pd.date_range(start='05/01/2019',end='05/10/2019'))
s2 = pd.Series(pd.date_range(start='05/04/2019',periods=10, freq='5d'))
s = pd.Series([np.busday_count(a, b) for a, b in zip(s1, s2)])
print (s)
0 3
1 5
2 7
3 10
4 14
5 17
6 19
7 23
8 25
9 27
dtype: int64
from xone import calendar
def business_dates(start, end):
us_cal = calendar.USTradingCalendar()
kw = dict(start=start, end=end)
return pd.bdate_range(**kw).drop(us_cal.holidays(**kw))
In [1]: business_dates(start='2018-12-20', end='2018-12-31')
Out[1]: DatetimeIndex(['2018-12-20', '2018-12-21', '2018-12-24', '2018-12-26',
'2018-12-27', '2018-12-28', '2018-12-31'],
dtype='datetime64[ns]', freq=None)
source Get business days between start and end date using pandas
#create dataframes with the dates
df=pd.DataFrame({'dates':pd.date_range(start='05/01/2019',end='05/31/2019')})
#check if the dates are in business days
df[df['dates'].isin(pd.bdate_range(df['dates'].get(0), df['dates'].get(len(df)-1)))]
out[]:
0 2019-05-01
1 2019-05-02
2 2019-05-03
5 2019-05-06
6 2019-05-07
7 2019-05-08
8 2019-05-09
9 2019-05-10
12 2019-05-13
13 2019-05-14
14 2019-05-15
15 2019-05-16
16 2019-05-17
19 2019-05-20
20 2019-05-21
21 2019-05-22
22 2019-05-23
23 2019-05-24
26 2019-05-27
27 2019-05-28
28 2019-05-29
29 2019-05-30
30 2019-05-31
I have a dataframe of transactions. One of my columns is the date (datetime64[ns]). I'm making a group by of users (email as id). Something I'm interested in is the variability of time between orders of each user. So what I'm looking for in the group by is to find the standard deviation of the difference between dates (in days) for each user. If the user has two or least transactions the answer should be 0. This is some of the dataframe (I changed some things manually):
df
email date
0 cuadros.paolo#gmail.com 2018-05-01 12:29:59
1 rlez_1202#hotmail.com 2018-07-11 13:43:22
2 cuadros.paolo#gmail.com 2018-09-21 12:29:23
3 paola.alvarado#rumah.com.pe 2018-09-01 09:21:43
4 luchosuito#gmail.com 2018-04-30 12:29:30
5 paola.alvarado#rumah.com.pe 2018-03-22 12:29:23
6 davida.alvarado.703#gmail.com 2018-07-21 12:29:17
7 cuadros.paolo#gmail.com 2018-08-11 12:29:41
8 rlez_1202#hotmail.com 2018-05-23 12:29:14
9 luchosuito#gmail.com 2018-06-01 12:29:17
10 jessica26011#hotmail.com 2018-07-18 12:29:20
11 cuadros.paolo#gmail.com 2018-08-21 12:29:40
12 rlez_1202#hotmail.com 2018-10-01 12:29:31
13 paola.alvarado#rumah.com.pe 2018-06-01 12:29:20
14 miluska-paico#hotmail.com 2018-05-21 12:29:18
15 cinthia_leon87#hotmail.com 2018-07-20 12:29:59
I've tried many ways, but still can't get it. Please help.
For sequential differences, which seems to make the most sense given your explanation:
df.sort_values('date').groupby('email').apply(lambda x: x.date.diff().std()).fillna(0)
Output:
email
cinthia_leon87#hotmail.com 0 days 00:00:00
cuadros.paolo#gmail.com 48 days 05:04:12.988006
davida.alvarado.703#gmail.com 0 days 00:00:00
jessica26011#hotmail.com 0 days 00:00:00
luchosuito#gmail.com 0 days 00:00:00
miluska-paico#hotmail.com 0 days 00:00:00
paola.alvarado#rumah.com.pe 14 days 18:10:16.764069
rlez_1202#hotmail.com 23 days 06:17:04.453408
dtype: timedelta64[ns]
.std() will be null for groups with 1 value non-null value and since .diff reduces the number of non-null observations by 1, this automatically returns NaN for any groups with 2 or fewer measurements, which we fill with 0.
Also just be aware that the default for pandas is to use N-1 degrees of freedom.
I have a DataFrame df with sporadic daily business day rows (i.e., there is not always a row for every business day.)
For each row in df I want to create a historical resampled mean dfm going back one month at a time. For example, if I have a row for 2018-02-22 then I want rolling means for rows in the following date ranges:
2018-01-23 : 2018-02-22
2017-12-23 : 2018-01-22
2017-11-23 : 2017-12-22
etc.
But I can't see a way to keep this pegged to the particular day of the month using conventional offsets. For example, if I do:
dfm = df.resample('30D').mean()
Then we see two problems:
It references the beginning of the DataFrame. In fact, I can't find a way to force .resample() to peg itself to the end of the DataFrame – even if I have it operate on df_reversed = df.loc[:'2018-02-22'].iloc[::-1]. Is there a way to "peg" the resampling to something other than the earliest date in the DataFrame? (And ideally pegged to each particular row as I run some lambda on the associated historical resampling from each row's date?)
It will drift over time, because not every month is 30 days long. So as I go back in time I will find that the interval 12 "months" prior ends 2017-02-27, not 2017-02-22 like I want.
Knowing that I want to resample by non-overlapping "months," the second problem can be well-defined for month days 29-31: For example, if I ask to resample for '2018-03-31' then the date ranges would end at the end of each preceding month:
2018-03-01 : 2018-03-31
2018-02-01 : 2018-02-28
2018-01-01 : 2018-02-31
etc.
Though again, I don't know: is there a good or easy way to do this in pandas?
tl;dr:
Given something like the following:
someperiods = 20 # this can be a number of days covering many years
somefrequency = '8D' # this can vary from 1D to maybe 10D
rng = pd.date_range('2017-01-03', periods=someperiods, freq=somefrequency)
df = pd.DataFrame({'x': rng.day}, index=rng) # x in practice is exogenous data
from pandas.tseries.offsets import *
df['MonthPrior'] = df.index.to_pydatetime() + DateOffset(months=-1)
Now:
For each row in df: calculate df['PreviousMonthMean'] = rolling average of all df.x in range [df.MonthPrior, df.index). In this example the resulting DataFrame would be:
Index x MonthPrior PreviousMonthMean
2017-01-03 3 2016-12-03 NaN
2017-01-11 11 2016-12-11 3
2017-01-19 19 2016-12-19 7
2017-01-27 27 2016-12-27 11
2017-02-04 4 2017-01-04 19
2017-02-12 12 2017-01-12 16.66666667
2017-02-20 20 2017-01-20 14.33333333
2017-02-28 28 2017-01-28 12
2017-03-08 8 2017-02-08 20
2017-03-16 16 2017-02-16 18.66666667
2017-03-24 24 2017-02-24 17.33333333
2017-04-01 1 2017-03-01 16
2017-04-09 9 2017-03-09 13.66666667
2017-04-17 17 2017-03-17 11.33333333
2017-04-25 25 2017-03-25 9
2017-05-03 3 2017-04-03 17
2017-05-11 11 2017-04-11 15
2017-05-19 19 2017-04-19 13
2017-05-27 27 2017-04-27 11
2017-06-04 4 2017-05-04 19
If we can get that far, then I need to find an efficient way to iterate that so that for each row in df I can aggregate consecutive but non-overlapping df['PreviousMonthMean'] values going back one calendar month at a time from the given DateTimeIndex....
My data looks like below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-25,44
2,2016-10-27,12
I want to fill in missing dates among id.
For example, the date range of id=1 is 2016-10-24 ~ 2016-10-28, and 2016-10-26 is missing. Moreover, the date range of id=2 is 2016-10-21 ~ 2016-10-27, and 2016-10-23, 2016-10-24 and 2016-10-26 are missing.
I want to fill in the missing dates and fill in the target value as 0.
Therefore, I want my data to be as below:
id, date, target
1,2016-10-24,22
1,2016-10-25,31
1,2016-10-26,0
1,2016-10-27,44
1,2016-10-28,12
2,2016-10-21,22
2,2016-10-22,31
2,2016-10-23,0
2,2016-10-24,0
2,2016-10-25,44
2,2016-10-26,0
2,2016-10-27,12
Can somebody help me?
Thanks in advance.
You can use groupby with resample - then is problem fillna - so need asfreq first:
#if necessary convert to datetime
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
df = df.groupby('id').resample('d')['target'].asfreq().fillna(0).astype(int).reset_index()
print (df)
id date target
0 1 2016-10-24 22
1 1 2016-10-25 31
2 1 2016-10-26 0
3 1 2016-10-27 44
4 1 2016-10-28 12
5 2 2016-10-21 22
6 2 2016-10-22 31
7 2 2016-10-23 0
8 2 2016-10-24 0
9 2 2016-10-25 44
10 2 2016-10-26 0
11 2 2016-10-27 12