I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
Related
Context
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.
For sake of legibility, this is a simplified version of the problem.
Data
Let's define the Data this way:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
Objectives
For each customerId, there are several distinct id.
The final aim is to get, for each date of the period-range, for each customerId, the count of distinct id whose start_date and end_date matches the function my_date_predicate.
Simplified definition of my_date_predicate:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
Awaited result
I'd like a time series result like this:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
Question
How could I use pandas to get such result?
Here's a solution:
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})
The result is:
count
month customerId
2000-01-01 1 2
2 0
2000-02-01 1 1
2 0
2000-03-01 1 1
2 0
2000-04-01 1 2
2 0
2000-05-01 1 2
2 1
2000-06-01 1 2
2 2
2000-07-01 1 1
2 1
Note:
For unset dates, replace the end date with the very last date you're interested in before you start the calculation.
You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index. reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start. Use cumsum to get the cumulative some of id per customer id. Finally use stack and reset_index to bring to the wanted shape.
#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')
#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
pve = (df.pivot_table(index='endDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
print (pvs)
customerId 1 2
2000-01 2 0 #two id for customer 1 that start at this month
2000-02 0 0
2000-03 0 0
2000-04 1 0
2000-05 0 1 #one id for customer 2 that start at this month
2000-06 0 1
2000-07 0 0
Now you can substract one to the other and use cumsum to get the wanted amount per date.
res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 1
Note really sure how to handle the unset_date as I don't see what is used for
I have a dataframe that looks this:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017','...']
sales = [1,2,3,4,1,2,'...']
days_left_in_m = [3,2,1,0,29,28,'...']
df_test = pd.DataFrame({'date': date,'days_left_in_m':days_left_in_m,'sales':sales})
df_test
I am trying to find sales for the rest of the month.
So, for 28th of Jan 2017 it will calculate sum of the next 3 days,
for 29th of Jan - sum of the next 2 days and so on...
The outcome should look like the "required" column below.
date days_left_in_m sales required
0 28-01-2017 3 1 10
1 29-01-2017 2 2 9
2 30-01-2017 1 3 7
3 31-01-2017 0 4 4
4 01-02-2017 29 1 3
5 02-02-2017 28 2 2
6 ... ... ... ...
My current solution is really ugly - I use a non-pythonic looping:
for i in range(lenght_of_t_series):
days_left = data_in.loc[i].days_left_in_m
if days_left == 0:
sales_temp_list.append(0)
else:
if (i+days_left) <= lenght_of_t_series:
sales_temp_list.append(sum(data_in.loc[(i+1):(i+days_left)].sales))
else:
sales_temp_list.append(np.nan)
I guess a much better way of doing this would be to use df['sales'].rolling(n).sum()
However, each row has a different window.
Please advise on the best way of doing this...
I think you need DataFrame.sort_values with GroupBy.cumsum.
If you do not want to take into account the current day you can
use groupby.shift (see commented code).
First you could convert date column to datetime in order to use Series.dt.month
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
Then we can use:
months = df_test['date'].dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
print(df_test)
Output
date days_left_in_m sales required
0 2017-01-28 3 1 10
1 2017-01-29 2 2 9
2 2017-01-30 1 3 7
3 2017-01-31 0 4 4
4 2017-02-01 29 1 3
5 2017-02-02 28 2 2
If you don't want convert date column to datetime use:
months = pd.to_datetime(df_test['date'],format = '%d-%m-%Y').dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
Im using pandas to do some calculations, and I need to sum some of values like count1 and count2 based on dates weekly or daily and so on,
my df =
id count1 ... count2 date
0 1 1 ... 52 2019-12-09
1 2 1 ... 23 2019-12-10
2 3 1 ... 0 2019-12-11
3 4 1 ... 17 2019-12-18
4 5 1 ... 20 2019-12-20
5 6 1 ... 4 2019-12-21
6 7 1 ... 2 2019-12-21
how can I do groupby date with weekly freq.?
I tried many ways but I got different errors
many thanks
Use DataFrame.resample by W with sum:
#convert date column to datetimes
df['date'] = pd.to_datetime(df['date'])
df1 = df.resample('W', on='date')['count1','count2'].sum()
Or use Grouper:
df1 = df.groupby(pd.Grouper(freq='W', key='date'))['count1','count2'].sum()
print (df1)
count1 count2
date
2019-12-15 3 75
2019-12-22 4 43
first, generate a week column so you can group by it ;
df['week_id'] = df['date'].dt.week
then group the dataframe and iterate through each group and do your stuff:
grouped_df = df.groupby('week_id')
for index, sub_df in grouped_df:
#use sub_df, it is data for each week
I have a relatively big data frame (~10mln rows). It has an id and DateTimeIndex. I have to count a number of entries with a certain id for each row for a period of time (last week\month\year). I've created my own function using relativedelta and storing dates in a separate dictionary {id: [dates]}, but it works extremely slow. How should I do it fast and properly?
P.S.: I've heard about pandas.rolling() but I can't figure out how to use it correctly.
P.P.S.: my function:
def isinrange(date, listdate, delta):
date,listdate = datetime.datetime.strptime(date,format),datetime.datetime.strptime(listdate,format)
return date-delta<=listdate
main code, contains tons of unnecessary operations:
dictionary = dict() #structure {id: [dates]}
for row in df.itertuples():#filling a dictionary
if row.id in dictionary:
dictionary[row.id].append(row.DateTimeIndex)
else:
dictionary[row.id] = [row.DateTimeIndex,]
week,month,year = relativedelta(days =7),relativedelta(months = 1),relativedelta(years = 1)#relative delta init
for row, i in zip(df.itertuples(),range(df.shape[0])):#iterating over dataframe
cnt1=cnt2=cnt3=0 #weekly,monthly, yearly - for each row
for date in dictionary[row.id]:#for each date with an id from row
index_date=row.DateTimeIndex
if date<=index_date: #if date from dictionary is lesser than from a row
if isinrange(index_date,date,year):
cnt1+=1
if isinrange(index_date,date,month):
cnt2+=1
if isinrange(index_date,date,week):
cnt3+=1
df.loc[[i,36],'Weekly'] = cnt1 #add values to a data frame
df.loc[[i,37],'Monthly'] = cnt2
df.loc[[i,38],'Yearly']=cnt3
Sample:
id date
1 2015-05-19
1 2015-05-22
2 2018-02-21
2 2018-02-23
2 2018-02-27
Expected result:
id date last_week
1 2015-05-19 0
1 2015-05-22 1
2 2018-02-21 0
2 2018-02-23 1
2 2018-02-27 2
year_range = ["2018"]
month_range = ["06"]
day_range = [str(x) for x in range(18, 25)]
date_range = [year_range, month_range, day_range]
# df = your dataframe
your_result = df[df.date.apply(lambda x: sum([x.split("-")[i] in date_range[i] for i in range(3)]) == 3)].groupby("id").size().reset_index(name="counts")
print(your_result[:5])
I'm not sure I understood correctly but is it something like this you're looking for?
Took ~15s with a 10 million rows "test" dataframe
id counts
0 0 454063
1 1 454956
2 2 454746
3 3 455317
4 4 454312
Wall time: 14.5 s
The "test" dataframe:
id date
0 4 2018-06-06
1 2 2018-06-18
2 4 2018-06-06
3 3 2018-06-18
4 5 2018-06-06
import pandas as pd
src = "path/data.csv"
df = pd.read_csv(src, sep=",")
print df
# id date
# 0 1 2015-05-19
# 1 1 2015-05-22
# 2 2 2018-02-21
# 3 2 2018-02-23
# 4 2 2018-02-27
# Convert date column to a datetime
df['date'] = pd.to_datetime(df['date'])
# Retrieve rows in the date range
date_ini = '2015-05-18'
date_end = '2016-05-18'
filtered_rows = df.loc[(df['date'] > date_ini) & (df['date'] <= date_end)]
print filtered_rows
# id date
# 0 1 2015-05-19
# 1 1 2015-05-22
# Group rows by id
grouped_by_id = filtered_rows.groupby(['id']).agg(['count'])
print grouped_by_id
# count
# id
# 1 2
I am new to python. I am trying to write the code on the python dataframe to loop through the data. Below is my initial data:
A B C Start Date End Date
1 2 5 01/01/15 1/31/15
1 2 4 02/01/15 2/28/15
1 2 7 02/25/15 3/15/15
1 2 9 03/11/15 3/30/15
1 2 8 03/14/15 4/5/15
1 2 3 03/31/15 4/10/15
1 2 4 04/05/15 4/27/15
1 2 11 04/15/15 4/20/15
4 5 23 5/6/16 6/6/16
4 5 12 6/10/16 7/10/16
I want to create a new column as forward_c. Forward_C is the data of that row which satisfies the conditions:
Column A and B should be equal.
Start_Date of the row should be greater than Start Date and End Date of the current Row.
The expected output is :
A B C Start Date End Date Forward_C
1 2 5 01/01/15 1/31/15 4
1 2 4 02/01/15 2/28/15 9
1 2 7 02/25/15 3/15/15 3
1 2 9 03/11/15 3/30/15 3
1 2 8 03/14/15 4/5/15 11
1 2 3 03/31/15 4/10/15 11
1 2 4 04/05/15 4/27/15 0
1 2 11 04/15/15 4/20/15 0
4 5 23 5/6/16 6/6/16 12
4 5 12 6/10/16 7/10/16 0
I wrote below code to achieve the same:
df = data.groupby(['A','B'], as_index = False).apply(lambda x:
x.sort_values(['Start Date','End Date'],ascending = True))
for i,j in df.iterrows():
for index,row in df.iterrows():
if (j['A'] == row['A']) and (j['B'] == row['B']) and (row['Start Date'] > j['End Date']) and (j['Start Date'] < row['Start Date']):
j['Forward_C'] = row['C']
df.loc[i,'Forward_C'] = row['C']
break
I was wondering if there is any more efficient way to do the same in python.
Because now my code will iterate through all the rows for each record. This will slow down the performance, since it will be dealing with more than 10 million records.
Your input is appreciated.
Thanks in advance.
Regards,
RD
I was not exactly clear with the question. based on my understanding, this is what i could come up with. Iam using Cross Join instead of a loop.
import pandas
data = #Actual Data Frame
data['Join'] = "CrossJoinColumn"
df1 = pandas.merge(data,data,how = "left",on = "Join",suffixes = ["","_2"])
df1 = df1[(df1['A'] == df1['A_2']) & (df1['B'] == df1['B_2']) & (df1['Start Date'] < df1['Start Date_2']) & (df1['End Date'] < df1['Start Date_2'])].groupby(by =['A','B','C','Start Date','End Date']).first().reset_index()[['A','B','C','Start Date','End Date','C_2']]
df1 = pandas.merge(data,df1,how = "left",on = ['A','B','C','Start Date','End Date'])