I have a dataframe that looks this:
import pandas as pd
date = ['28-01-2017','29-01-2017','30-01-2017','31-01-2017','01-02-2017','02-02-2017','...']
sales = [1,2,3,4,1,2,'...']
days_left_in_m = [3,2,1,0,29,28,'...']
df_test = pd.DataFrame({'date': date,'days_left_in_m':days_left_in_m,'sales':sales})
df_test
I am trying to find sales for the rest of the month.
So, for 28th of Jan 2017 it will calculate sum of the next 3 days,
for 29th of Jan - sum of the next 2 days and so on...
The outcome should look like the "required" column below.
date days_left_in_m sales required
0 28-01-2017 3 1 10
1 29-01-2017 2 2 9
2 30-01-2017 1 3 7
3 31-01-2017 0 4 4
4 01-02-2017 29 1 3
5 02-02-2017 28 2 2
6 ... ... ... ...
My current solution is really ugly - I use a non-pythonic looping:
for i in range(lenght_of_t_series):
days_left = data_in.loc[i].days_left_in_m
if days_left == 0:
sales_temp_list.append(0)
else:
if (i+days_left) <= lenght_of_t_series:
sales_temp_list.append(sum(data_in.loc[(i+1):(i+days_left)].sales))
else:
sales_temp_list.append(np.nan)
I guess a much better way of doing this would be to use df['sales'].rolling(n).sum()
However, each row has a different window.
Please advise on the best way of doing this...
I think you need DataFrame.sort_values with GroupBy.cumsum.
If you do not want to take into account the current day you can
use groupby.shift (see commented code).
First you could convert date column to datetime in order to use Series.dt.month
df_test['date'] = pd.to_datetime(df_test['date'],format = '%d-%m-%Y')
Then we can use:
months = df_test['date'].dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
print(df_test)
Output
date days_left_in_m sales required
0 2017-01-28 3 1 10
1 2017-01-29 2 2 9
2 2017-01-30 1 3 7
3 2017-01-31 0 4 4
4 2017-02-01 29 1 3
5 2017-02-02 28 2 2
If you don't want convert date column to datetime use:
months = pd.to_datetime(df_test['date'],format = '%d-%m-%Y').dt.month
df_test['required'] = (df_test.sort_values('date',ascending = False)
.groupby(months)['sales'].cumsum()
#.groupby(months).shift(fill_value = 0)
)
Related
I'm having this data frame:
id date count
1 8/31/22 1
1 9/1/22 2
1 9/2/22 8
1 9/3/22 0
1 9/4/22 3
1 9/5/22 5
1 9/6/22 1
1 9/7/22 6
1 9/8/22 5
1 9/9/22 7
1 9/10/22 1
2 8/31/22 0
2 9/1/22 2
2 9/2/22 0
2 9/3/22 5
2 9/4/22 1
2 9/5/22 6
2 9/6/22 1
2 9/7/22 1
2 9/8/22 2
2 9/9/22 2
2 9/10/22 0
I want to aggregate the count by id and date to get sum of quantities Details:
Date: the all counts in a week should be aggregated on Saturday. A week starts from Sunday and ends on Saturday. The time period (the first day and the last day of counts) is fixed for all of the ids.
The desired output is given below:
id date count
1 9/3/22 11
1 9/10/22 28
2 9/3/22 7
2 9/10/22 13
I have already the following code for this work and it does work but it is not efficient as it takes a long time to run for a large database. I am looking for a much faster and efficient way to get the output:
df['day_name'] = new_df['date'].dt.day_name()
df_week_count = pd.DataFrame(columns=['id', 'date', 'count'])
for id in ids:
# make a dataframe for each id
df_id = new_df.loc[new_df['id'] == id]
df_id.reset_index(drop=True, inplace=True)
# find Starudays index
saturday_indices = df_id.loc[df_id['day_name'] == 'Saturday'].index
j = 0
sat_index = 0
while(j < len(df_id)):
# find sum of count between j and saturday_index[sat_index]
sum_count = df_id.loc[j:saturday_indices[sat_index], 'count'].sum()
# add id, date, sum_count to df_week_count
temp_df = pd.DataFrame([[id, df_id.loc[saturday_indices[sat_index], 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
j = saturday_indices[sat_index] + 1
sat_index += 1
if sat_index >= len(saturday_indices):
break
if(j < len(df_id)):
sum_count = df_id.loc[j:, 'count'].sum()
temp_df = pd.DataFrame([[id, df_id.loc[len(df_id) - 1, 'date'], sum_count]], columns=['id', 'date', 'count'])
df_week_count = pd.concat([df_week_count, temp_df], ignore_index=True)
df_final = df_week_count.copy(deep=True)
Create a grouping factor from the dates.
week = pd.to_datetime(df['date'].to_numpy()).strftime('%U %y')
df.groupby(['id',week]).agg({'date':max, 'count':sum}).reset_index()
id level_1 date count
0 1 35 22 9/3/22 11
1 1 36 22 9/9/22 28
2 2 35 22 9/3/22 7
3 2 36 22 9/9/22 13
I tried to understand as much as i can :)
here is my process
# reading data
df = pd.read_csv(StringIO(data), sep=' ')
# data type fix
df['date'] = pd.to_datetime(df['date'])
# initial grouping
df = df.groupby(['id', 'date'])['count'].sum().to_frame().reset_index()
df.sort_values(by=['date', 'id'], inplace=True)
df.reset_index(drop=True, inplace=True)
# getting name of the day
df['day_name'] = df.date.dt.day_name()
# getting week number
df['week'] = df.date.dt.isocalendar().week
# adjusting week number to make saturday the last day of the week
df.loc[df.day_name == 'Sunday','week'] = df.loc[df.day_name == 'Sunday', 'week'] + 1
what i think you are looking for
df.groupby(['id','week']).agg(count=('count','sum'), date=('date','max')).reset_index()
id
week
count
date
0
1
35
11
2022-09-03 00:00:00
1
1
36
28
2022-09-10 00:00:00
2
2
35
7
2022-09-03 00:00:00
3
2
36
13
2022-09-10 00:00:00
I would like to add a column to my dataset which corresponds to the time stamp, and counts the day by steps. That is, for one year there should be 365 "steps" and I would like for all grouped payments for each account on day 1 to be labeled 1 in this column and all payments on day 2 are then labeled 2 and so on up to day 365. I would like it to look something like this:
account time steps
0 A 2022.01.01 1
1 A 2022.01.02 2
2 A 2022.01.02 2
3 B 2022.01.01 1
4 B 2022.01.03 3
5 B 2022.01.05 5
I have tried this:
def day_step(x):
x['steps'] = x.time.dt.day.shift()
return x
df = df.groupby('account').apply(day_step)
however, it only counts for each month, once a new month begins it starts again from 1.
How can I fix this to make it provide the step count for the entire year?
Use GroupBy.transform with first or min Series, subtract column time, convert timedeltas to days and add 1:
df['time'] = pd.to_datetime(df['time'])
df['steps1'] = (df['time'].sub(df.groupby('account')['time'].transform('first'))
.dt.days
.add(1)
print (df)
account time steps steps1
0 A 2022-01-01 1 1
1 A 2022-01-02 2 2
2 A 2022-01-02 2 2
3 B 2022-01-01 1 1
4 B 2022-01-03 3 3
5 B 2022-01-05 5 5
First idea, working only if first row is January 1:
df['steps'] = df['time'].dt.dayofyear
I am fairly new to working with pandas. I have a dataframe with individual entries like this:
dfImport:
id
date_created
date_closed
0
01-07-2020
1
02-09-2020
10-09-2020
2
07-03-2019
02-09-2020
I would like to filter it in a way, that I get the total number of created and closed objects (count id's) grouped by Year and Quarter and Month like this:
dfInOut:
Year
Qrt
month
number_created
number_closed
2019
1
March
1
0
2020
3
July
1
0
September
1
2
I guess I'd have to use some combination of crosstab or group_by, but I tried out alot of ideas and already did research on the problem, but I can't seem to figure out a way. I guess it's an issue of understanding. Thanks in advance!
Use DataFrame.melt with crosstab:
df['date_created'] = pd.to_datetime(df['date_created'], dayfirst=True)
df['date_closed'] = pd.to_datetime(df['date_closed'], dayfirst=True)
df1 = df.melt(value_vars=['date_created','date_closed']).dropna()
df = (pd.crosstab([df1['value'].dt.year.rename('Year'),
df1['value'].dt.quarter.rename('Qrt'),
df1['value'].dt.month.rename('Month')], df1['variable'])
[['date_created','date_closed']])
print (df)
variable date_created date_closed
Year Qrt Month
2019 1 3 1 0
2020 3 7 1 0
9 1 2
df = df.rename_axis(None, axis=1).reset_index()
print (df)
Year Qrt Month date_created date_closed
0 2019 1 3 1 0
1 2020 3 7 1 0
2 2020 3 9 1 2
Below is one column from a Data Frame:
Year
1981
1988
-1
1921
2000
-1
I need to convert these years (except for rows with -1 in them) into days from 2020.
I have written:
df['Year'] = 2020 - df['Year']*365
How do I convert these years into days without considering -1 values. -1 needs to stay the way it is. I need to convert the years only.
Ddi you want something like?
df.loc[df['Year'] != -1, 'Year'] = (2020-df.loc[df['Year'] != -1, 'Year'])*365
Output:
Year
0 14235
1 11680
2 -1
3 36135
4 7300
5 -1
Another way:
df = df.assign(Year=((2020-df.loc[df['Year'] != -1])*365)).fillna(-1)
df
Output:
Year
0 14235.0
1 11680.0
2 -1.0
3 36135.0
4 7300.0
5 -1.0
Third way:
df['Year'] = ((2020 - df['Year'])*365).mask(df['Year']==-1, -1)
Output:
Year
0 14235
1 11680
2 -1
3 36135
4 7300
5 -1
Here you go:
import numpy as np
df['Year'] = np.where(df.Year != -1, (2020 - df.Year) * 365, -1)
print(df)
Output
Year
0 14235
1 11680
2 -1
3 36135
4 7300
5 -1
I am new to python. I am trying to write the code on the python dataframe to loop through the data. Below is my initial data:
A B C Start Date End Date
1 2 5 01/01/15 1/31/15
1 2 4 02/01/15 2/28/15
1 2 7 02/25/15 3/15/15
1 2 9 03/11/15 3/30/15
1 2 8 03/14/15 4/5/15
1 2 3 03/31/15 4/10/15
1 2 4 04/05/15 4/27/15
1 2 11 04/15/15 4/20/15
4 5 23 5/6/16 6/6/16
4 5 12 6/10/16 7/10/16
I want to create a new column as forward_c. Forward_C is the data of that row which satisfies the conditions:
Column A and B should be equal.
Start_Date of the row should be greater than Start Date and End Date of the current Row.
The expected output is :
A B C Start Date End Date Forward_C
1 2 5 01/01/15 1/31/15 4
1 2 4 02/01/15 2/28/15 9
1 2 7 02/25/15 3/15/15 3
1 2 9 03/11/15 3/30/15 3
1 2 8 03/14/15 4/5/15 11
1 2 3 03/31/15 4/10/15 11
1 2 4 04/05/15 4/27/15 0
1 2 11 04/15/15 4/20/15 0
4 5 23 5/6/16 6/6/16 12
4 5 12 6/10/16 7/10/16 0
I wrote below code to achieve the same:
df = data.groupby(['A','B'], as_index = False).apply(lambda x:
x.sort_values(['Start Date','End Date'],ascending = True))
for i,j in df.iterrows():
for index,row in df.iterrows():
if (j['A'] == row['A']) and (j['B'] == row['B']) and (row['Start Date'] > j['End Date']) and (j['Start Date'] < row['Start Date']):
j['Forward_C'] = row['C']
df.loc[i,'Forward_C'] = row['C']
break
I was wondering if there is any more efficient way to do the same in python.
Because now my code will iterate through all the rows for each record. This will slow down the performance, since it will be dealing with more than 10 million records.
Your input is appreciated.
Thanks in advance.
Regards,
RD
I was not exactly clear with the question. based on my understanding, this is what i could come up with. Iam using Cross Join instead of a loop.
import pandas
data = #Actual Data Frame
data['Join'] = "CrossJoinColumn"
df1 = pandas.merge(data,data,how = "left",on = "Join",suffixes = ["","_2"])
df1 = df1[(df1['A'] == df1['A_2']) & (df1['B'] == df1['B_2']) & (df1['Start Date'] < df1['Start Date_2']) & (df1['End Date'] < df1['Start Date_2'])].groupby(by =['A','B','C','Start Date','End Date']).first().reset_index()[['A','B','C','Start Date','End Date','C_2']]
df1 = pandas.merge(data,df1,how = "left",on = ['A','B','C','Start Date','End Date'])