Task:
Calculate the frequency of each ID for each month of 2021
Frequency formula: Activity period (Length of time between last activity and first activity) / (Number of activity Days - 1)
e.g. ID 1 - Month 2: Activity Period (2021-02-23 - 2021-02-18 = 5 days) / (3 active days - 1) == Frequency = 2,5
Sample:
times = [
'2021-02-18',
'2021-02-22',
'2021-02-23',
'2021-04-23',
'2021-01-18',
'2021-01-19',
'2021-01-20',
'2021-01-03',
'2021-02-04',
'2021-02-04'
]
id = [1, 1, 1, 1, 44, 44, 44, 46, 46, 46]
df = pd.DataFrame({'ID':id, 'Date': pd.to_datetime(times)})
df = df.reset_index(drop=True)
print(df)
ID Date
0 1 2021-02-18
1 1 2021-02-22
2 1 2021-02-23
3 1 2021-04-23
4 44 2021-01-18
5 44 2021-01-19
6 44 2021-01-20
7 46 2021-01-03
8 46 2021-02-04
9 46 2021-02-04
Desired Output:
If frequency negative == 0
id 01_2021 02_2021 03_2021 04_2021
0 1 0 2 0 0
1 44 1 0 0 0
2 46 0 0 0 0
Try a pivot_table with a custom aggfunc:
# Create Columns For Later
dr = pd.date_range(start=df['Date'].min(),
end=df['Date'].max() + pd.offsets.MonthBegin(1), freq='M') \
.map(lambda dt: dt.strftime('%m_%Y'))
new_df = (
df.pivot_table(
index='ID',
# Columns are dates in MM_YYYY format
columns=df['Date'].dt.strftime('%m_%Y'),
# Custom Agg Function
aggfunc=lambda x: (x.max() - x.min()) /
pd.offsets.Day(max(1, len(x) - 1))
# max(1, len(x) -1) to prevent divide by 0
)
# Fix Axis Names and Column Levels
.droplevel(0, axis=1)
.rename_axis(None, axis=1)
# Reindex to include every month from min to max date
.reindex(dr, axis=1)
# Clip to exclude negatives
.clip(lower=0)
# Fillna with 0
.fillna(0)
# Reset index
.reset_index()
)
print(new_df)
new_df:
ID 01_2021 02_2021 03_2021 04_2021
0 1 0.0 2.5 0.0 0.0
1 44 1.0 0.0 0.0 0.0
2 46 0.0 0.0 0.0 0.0
You will need to pivot the table, but first if you want only the month and year of the date, you need to transform it.
df['Date'] = df.Date.map(lambda s: "{}_{}".format(s.year,s.month))
df['counts'] = 1
df_new = pd.pivot_table(df, index=['ID'],
columns=['Date'], aggfunc=np.sum)
Related
I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
What is the best way to find the largest value after date of each row, for example i have this dataframe:
import pandas as pd
data = [[20200101, 10], [20200102, 16], [20200103, 14], [20200104, 18]]
df = pd.DataFrame(data, columns=['date', 'value'])
print(df)
date value
0 20200101 10
1 20200102 16
2 20200103 14
3 20200104 18
i need to get the first largest value date after each row date :
date value largest_value_date
0 20200101 10 20200102
1 20200102 16 20200104
2 20200103 14 20200104
3 20200104 18 0
of course i tried with "for" but in big data it's very slow:
df['largest_value_date'] = 0
for i in range(0, len(df)):
date = df['date'].iloc[i]
value = df['value'].iloc[i]
largestDate = df[(df['date'] > date) & (df['value'] > value)]
if len(largestDate) > 0:
df['largest_value_date'].iloc[i] = largestDate['date'].iloc[0]
print(df)
date value largest_value_date
0 20200101 10 20200102
1 20200102 16 20200104
2 20200103 14 20200104
3 20200104 18 0
We can speed up the whole process with numpy board cast then idxmax , get the most recent values' id greater than the current row , then assign it back
s = df['value'].values
idx = pd.DataFrame(np.triu(s-s[:,None])).gt(0).idxmax(1)
df['new'] = df['date'].reindex(idx.replace(0,-1)).values
df
Out[158]:
date value new
0 20200101 10 20200102.0
1 20200102 16 20200104.0
2 20200103 14 20200104.0
3 20200104 18 NaN
So I have the following dataframe:
Period group ID
20130101 A 10
20130101 A 20
20130301 A 20
20140101 A 20
20140301 A 30
20140401 A 40
20130101 B 11
20130201 B 21
20130401 B 31
20140401 B 41
20140501 B 51
I need to count how many different ID there are by group in the last year. So my desired output would look like this:
Period group num_ids_last_year
20130101 A 2 # ID 10 and 20 in the last year
20130301 A 2
20140101 A 2
20140301 A 2 # ID 30 enters, ID 10 leaves
20140401 A 3 # ID 40 enters
20130101 B 1
20130201 B 2
20130401 B 3
20140401 B 2 # ID 11 and 21 leave
20140501 B 2 # ID 31 leaves, ID 51 enters
Period is in datetime format. I tried many things along the lines of:
df.groupby(['group','Period'])['ID'].nunique() # Get number of IDs by group in a given period.
df.groupby(['group'])['ID'].nunique() # Get total number of IDs by group.
df.set_index('Period').groupby('group')['ID'].rolling(window=1, freq='Y').nunique()
But the last one isn't even possible. Is there any straightforward way to do this? I'm thinking maybe some kind of combination of cumcount() and pd.DateOffset or maybe ge(df.Period - dt.timedelta(365), but I can't find the answer.
Thanks.
Edit: added the fact that I can find more than one ID in a given Period
looking at your data structure, I am guessing you have MANY duplicates, so start with dropping them. drop_duplicates tend to be fast
I am assuming that df['Period'] columns is of dtype datetime64[ns]
df = df.drop_duplicates()
results = dict()
for start in df['Period'].drop_duplicates():
end = start.date() - relativedelta(years=1)
screen = (df.Period <= start) & (df.Period >= end) # screen for 1 year of data
singles = df.loc[screen, ['group', 'ID']].drop_duplicates() # screen for same year ID by groups
x = singles.groupby('group').count()
results[start] = x
results = pd.concat(results, 0)
results
ID
group
2013-01-01 A 2
B 1
2013-02-01 A 2
B 2
2013-03-01 A 2
B 2
2013-04-01 A 2
B 3
2014-01-01 A 2
B 3
2014-03-01 A 2
B 1
2014-04-01 A 3
B 2
2014-05-01 A 3
B 2
is that any faster?
p.s. if df['Period'] is not a datetime:
df['Period'] = pd.to_datetime(df['Period'],format='%Y%m%d', errors='ignore')
Here the solution using groupby and rolling. Note: your desired ouput counts a year from YYYY0101 to next year YYYY0101, so you need rolling 366D instead of 365D
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
df = df.set_index('Period')
df_final = (df.groupby('group')['ID'].rolling(window='366D')
.apply(lambda x: np.unique(x).size, raw=True)
.reset_index(name='ID_count')
.drop_duplicates(['group','Period'], keep='last'))
Out[218]:
group Period ID_count
1 A 2013-01-01 2.0
2 A 2013-03-01 2.0
3 A 2014-01-01 2.0
4 A 2014-03-01 2.0
5 A 2014-04-01 3.0
6 B 2013-01-01 1.0
7 B 2013-02-01 2.0
8 B 2013-04-01 3.0
9 B 2014-04-01 2.0
10 B 2014-05-01 2.0
Note: On 18M+ rows, I don't think this solution will make it at 10 mins. I hope it would take about 30 mins.
from dateutil.relativedelta import relativedelta
df.sort_values(by=['Period'], inplace=True) # if not already sorted
# create new output df
df1 = (df.groupby(['Period','group'])['ID']
.apply(lambda x: list(x))
.reset_index())
df1['num_ids_last_year'] = df1.apply(lambda x: len(set(df1.loc[(df1['Period'] >= x['Period']-relativedelta(years=1)) & (df1['Period'] <= x['Period']) & (df1['group'] == x['group'])].ID.apply(pd.Series).stack())), axis=1)
df1.sort_values(by=['group'], inplace=True)
df1.drop('ID', axis=1, inplace=True)
df1 = df1.reset_index(drop=True)
New to Python and coding in general here so this should be pretty basic for most of you.
I basically created this dataframe with a Datetime index.
Here's the dataframe
df = pd.date_range(start='2018-01-01', end='2019-12-31', freq='D')
I would now like to add a new variable to my df called "vacation" with a value of 1 if the date is between 2018-06-24 and 2018-08-24 and value of 0 if it's not between those dates. How can I go about doing this?
I've created a variable with a range of vacation but I'm not sure how to put these two together along with creating a new column for "vacation" in my dataframe.
vacation = pd.date_range(start = '2018-06-24', end='2018-08-24')
Thanks in advance.
First, pd.date_range(start='2018-01-01', end='2019-12-31', freq='D') will not create a DataFrame instead it will create a DatetimeIndex. You can then convert it into a DataFrame by having it as an index or a separate column.
# Having it as an index
datetime_index = pd.date_range(start='2018-01-01', end='2019-12-31', freq='D')
df = pd.DataFrame({}, index=datetime_index)
# Using numpy.where() to create the Vacation column
df['Vacation'] = np.where((df.index >= '2018-06-24') & (df.index <= '2018-08-24'), 1, 0)
Or
# Having it as a column
datetime_index = pd.date_range(start='2018-01-01', end='2019-12-31', freq='D')
df = pd.DataFrame({'Date': datetime_index})
# Using numpy.where() to create the Vacation column
df['Vacation'] = np.where((df['Date'] >= '2018-06-24') & (df['Date'] <= '2018-08-24'), 1, 0)
Note: Displaying only the first five rows of the dataframe df.
Solution for new DataFrame:
i = pd.date_range(start='2018-01-01', end='2018-08-26', freq='D')
m = (i > '2018-06-24') & (i < '2018-08-24')
df = pd.DataFrame({'vacation': m.astype(int)}, index=i)
Or:
df = pd.DataFrame({'vacation':np.where(m, 1, 0)}, index=i)
print (df)
vacation
2018-01-01 0
2018-01-02 0
2018-01-03 0
2018-01-04 0
2018-01-05 0
...
2018-08-22 1
2018-08-23 1
2018-08-24 0
2018-08-25 0
2018-08-26 0
[238 rows x 1 columns]
Solution for add new column to existing DataFrame:
Create mask by compare DatetimeIndex with chaining by & for bitwise AND and convert it to integer (True to 1 and False to 0) or use numpy.where:
i = pd.date_range(start='2018-01-01', end='2018-08-26', freq='D')
df = pd.DataFrame({'a': 1}, index=i)
m = (df.index > '2018-06-24') & (df.index < '2018-08-24')
df['vacation'] = m.astype(int)
#alternative
#df['vacation'] = np.where(m, 1, 0)
print (df)
a vacation
2018-01-01 1 0
2018-01-02 1 0
2018-01-03 1 0
2018-01-04 1 0
2018-01-05 1 0
.. ...
2018-08-22 1 1
2018-08-23 1 1
2018-08-24 1 0
2018-08-25 1 0
2018-08-26 1 0
[238 rows x 2 columns]
Another solution with DatetimeIndex and DataFrame.loc - difference is 1 included 2018-06-24 and 2018-08-24 edge values:
df['vacation'] = 0
df.loc['2018-06-24':'2018-08-24'] = 1
print (df)
a vacation
2018-01-01 1 0
2018-01-02 1 0
2018-01-03 1 0
2018-01-04 1 0
2018-01-05 1 0
.. ...
2018-08-22 1 1
2018-08-23 1 1
2018-08-24 1 1
2018-08-25 1 0
2018-08-26 1 0
[238 rows x 2 columns]
I have a dataframe df1, and I want to calculate the days between two dates given three conditions and create a new column DiffDays with the difference in days.
1) When Yes is 1
2) When values in Value are non-zero
3) Must be UserId specific (perhaps with groupby())
df1 = pd.DataFrame({'Date':['02.01.2017', '03.01.2017', '04.01.2017', '05.01.2017', '01.01.2017', '02.01.2017', '03.01.2017'],
'UserId':[1,1,1,1,2,2,2],
'Value':[0,0,0,100,0,1000,0],
'Yes':[1,0,0,0,1,0,0]})
For example, when Yes is 1, calculate the dates between when Value is non-zero, which is 05.01.2017 and when Yes is 1, which is 02.01.2017. The result is three days for UserId in row 3.
Expected outcome:
Date UserId Value Yes DiffDays
0 02.01.2017 1 0.0 1 0
1 03.01.2017 1 0.0 0.0 0
2 04.01.2017 1 0.0 0.0 0
3 05.01.2017 1 100 0.0 3
4 01.01.2017 2 0.0 1 0
5 02.01.2017 2 1000 0.0 1
6 03.01.2017 2 0.0 0.0 0
I couldn't find anything on Stackoverflow about this, and not sure how to start.
def dayDiff(groupby):
if (not (groupby.Yes == 1).any()) or (not (groupby.Value > 0).any()):
return np.zeros(groupby.Date.count())
min_date = groupby[groupby.Yes == 1].Date.iloc[0]
max_date = groupby[groupby.Value > 0].Date.iloc[0]
delta = max_date - min_date
return np.where(groupby.Value > 0 , delta.days, 0)
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
DateDiff = df1.groupby('UserId').apply(dayDiff).explode().rename('DateDiff').reset_index(drop=True)
pd.concat([df1, DateDiff], axis=1)
Returns:
Date UserId Value Yes DateDiff
0 2017-01-02 1 0 1 0
1 2017-01-03 1 0 0 0
2 2017-01-04 1 0 0 0
3 2017-01-05 1 100 0 3
4 2017-01-01 2 0 1 0
5 2017-01-02 2 1000 0 1
6 2017-01-03 2 0 0 0
Although this answers your question, the date diff logic is hard to follow, especially when it comes to the placement of the DateDiff values.
Update
pd.Series.explode() was only introduced in pandas version 0.25, for those using previous versions:
df1.Date = pd.to_datetime(df1.Date, dayfirst=True)
DateDiff = (df1
.groupby('UserId')
.apply(dayDiff)
.to_frame()
.explode(0)
.reset_index(drop=True)
.rename(columns={0: 'DateDiff'}))
pd.concat([df1, DateDiff], axis=1)
This will yield the same results.