Sorry I'm new to python.
I have a dataframe of entities that log values once a month. For each unique entity in my dataframe, I locate the max value then locate the max value's corresponding month. Using the max value month, a time delta between each other unique entity's months and the max month can be calculated in days. This works for small dataframes.
I know my loop is not performant and can't scale to larger dataframes(e.g., 3M rows (+156MB)). After weeks of research I've gathered that my loop is degenerate and feel there is a numpy solution or something more pythonic. Can someone see a more performant approach to calculating this time delta in days?
I've tried various value.shift(x) calculations in a lambda function, but the peak value is not consistent. I've also tried calculating more columns to minimize my loop calculations.
import pandas as pd
df = pd.DataFrame({'entity':['A','A','A','A','B','B','B','C','C','C','C','C'], 'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019','1/31/2009','2/28/2009','3/31/2009','8/31/2011','9/30/2011','10/31/2011','11/30/2011','12/31/2011'], 'value':['80','600','500','400','150','300','100','200','250','300','200','175'], 'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']})
df['month'] = df['month'].apply(pd.to_datetime)
for entity in set(df['entity']):
# set peak value
peak_value = df.loc[df['entity'] == entity, 'value'].max()
# set peak value date
peak_date = df.loc[(df['entity'] == entity) & (df['value'] == peak_value), 'month'].min()
# subtract peak date from current date
delta = df.loc[df['entity'] == entity, 'month'] - peak_date
# update days_delta with delta in days
df.loc[df['entity'] == entity, 'days_delta'] = delta
RESULT:
entity month value month_number days_delta
A 2018-10-31 80 1 0 days
A 2018-11-30 600 2 30 days
A 2018-12-31 500 3 61 days
A 2019-01-31 400 4 92 days
B 2009-01-31 150 1 -28 days
B 2009-02-28 300 2 0 days
B 2009-03-31 100 3 31 days
C 2011-08-31 200 1 -61 days
C 2011-09-30 250 2 -31 days
C 2011-10-31 300 3 0 days
C 2011-11-30 200 4 30 days
C 2011-12-31 175 5 61 days
Setup
First let's also make sure value is numeric
df = pd.DataFrame({
'entity':['A','A','A','A','B','B','B','C','C','C','C','C'],
'month': ['10/31/2018','11/30/2018','12/31/2018','1/31/2019',
'1/31/2009','2/28/2009','3/31/2009','8/31/2011',
'9/30/2011','10/31/2011','11/30/2011','12/31/2011'],
'value':['80','600','500','400','150','300','100','200','250','300','200','175'],
'month_number': ['1','2','3','4','1','2','3','1','2','3','4','5']
})
df['month'] = df['month'].apply(pd.to_datetime)
df['value'] = pd.to_numeric(df['value'])
transform and idxmax
max_months = df.groupby('entity').value.transform('idxmax').map(df.month)
df.assign(days_delta=df.month - max_months)
entity month value month_number days_delta
0 A 2018-10-31 80 1 -30 days
1 A 2018-11-30 600 2 0 days
2 A 2018-12-31 500 3 31 days
3 A 2019-01-31 400 4 62 days
4 B 2009-01-31 150 1 -28 days
5 B 2009-02-28 300 2 0 days
6 B 2009-03-31 100 3 31 days
7 C 2011-08-31 200 1 -61 days
8 C 2011-09-30 250 2 -31 days
9 C 2011-10-31 300 3 0 days
10 C 2011-11-30 200 4 30 days
11 C 2011-12-31 175 5 61 days
Related
I have a large data set and I want to add values to a column based on the higest values in another column in my data set.
Easy, I can just use df.quantile() and access the appropriate values
However, I want to check for each month in each year.
I solved it for looking at years only, see code below.
I'm sure I could do it for months with nested for loops but I'd rather avoid it if I can.
I guess the most pythonic way would by to not loop at all but use pandas in a smarter way..
Any suggestion?
Sample code:
df = pd.read_excel(file)
df.index = df['date']
df = df.drop('date', axis=1)
df['new'] = 0
year = (2016, 2017, 2018, 2019, 2020)
for i in year:
df['new'].loc[str(i)] = np.where(df['cost'].loc[str(i)] < df['cost'].loc[str(i)].quantile(0.5), 0, 1)
print(df)
Sample input
file =
cost
date
2016-11-01 30
2016-12-01 29
2017-11-01 40
2017-12-01 45
2018-11-30 240
2018-12-01 200
2019-11-30 220
2019-12-30 180
2020-11-30 150
2020-12-30 130
Output
cost new
date
2016-11-01 30 1
2016-12-01 29 0
2017-11-01 40 0
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 0
2019-11-30 220 1
2019-12-30 180 0
2020-11-30 150 1
2020-12-30 130 0
Desired output (if quantile works like that on single values, but as an example)
cost new
date
2016-11-01 30 1
2016-12-01 29 1
2017-11-01 40 1
2017-12-01 45 1
2018-11-30 240 1
2018-12-01 200 1
2019-11-30 220 1
2019-12-30 180 1
2020-11-30 150 1
2020-12-30 130 1
Thank you _/_
An interesting question, it took me a little while to work out a solution!
import pandas as pd
df = pd.DataFrame(data={"cost": [30, 29, 40, 45, 240, 200, 220, 180, 150, 130],
"date": ["2016-11-01", "2016-12-01", "2017-11-01",
"2017-12-01", "2018-11-30", "2018-12-01",
"2019-11-30", "2019-12-30", "2020-11-30",
"2020-12-30"]})
df["date"] = pd.to_datetime(df["date"])
df.set_index("date", inplace=True)
df["new"] = df.groupby([lambda x: x.year, lambda x: x.month]).transform(lambda x: (x >= x.quantile(0.5))*1)
#Out:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 1
#2017-11-01 40 1
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 1
#2019-11-30 220 1
#2019-12-30 180 1
#2020-11-30 150 1
#2020-12-30 130 1
What the important line does:
Groups by the index year and month
For each item in the group, calculates whether it is greater than or equal to the 0.5 quantile (as bool)
Multiplying by 1 creates an integer bool (1/0) instead of True/False
The initial creation of the dataframe should be equivalent to your df = pd.read_excel(file)
Leaving out the , lambda x: x.month part of the groupby (by year only), the output is the same as your current output:
# cost new
#date
#2016-11-01 30 1
#2016-12-01 29 0
#2017-11-01 40 0
#2017-12-01 45 1
#2018-11-30 240 1
#2018-12-01 200 0
#2019-11-30 220 1
#2019-12-30 180 0
#2020-11-30 150 1
#2020-12-30 130 0
I have some data with dates of sales to my clients.
The data looks like this:
Cod client
Items
Date
0
100
1
2022/01/01
1
100
7
2022/01/01
2
100
2
2022/02/01
3
101
5
2022/01/01
4
101
8
2022/02/01
5
101
10
2022/02/01
6
101
2
2022/04/01
7
101
2
2022/04/01
8
102
4
2022/02/01
9
102
10
2022/03/01
What I'm trying to acomplish is to calculate the differences beetween dates for each client: grouped first by "Cod client" and after by "Date" (because of the duplicates)
The expected result is like:
Cod client
Items
Date
Date diff
Explain
0
100
1
2022/01/01
NaT
First date for client 100
1
100
7
2022/01/01
NaT
...repeat above
2
100
2
2022/02/01
31
Diff from first date 2022/01/01
3
101
5
2022/01/01
NaT
Fist date for client 101
4
101
8
2022/02/01
31
Diff from first date 2022/01/01
5
101
10
2022/02/01
31
...repeat above
6
101
2
2022/04/01
59
Diff from previous date 2022/02/01
7
101
2
2022/04/01
59
...repeat above
8
102
4
2022/02/01
NaT
First date for client 102
9
102
10
2022/03/01
28
Diff from first date 2022/02/01
I already tried doing df["Date diff"] = df.groupby("Cod client")["Date"].diff() but it considers the repeated dates and return zeroes for then
I appreciate for help!
IIUC you can combine several groupby operations:
# ensure datetime
df['Date'] = pd.to_datetime(df['Date'])
# set up group
g = df.groupby('Cod client')
# identify duplicated dates per group
m = g['Date'].apply(pd.Series.duplicated)
# compute the diff, mask and ffill
df['Date diff'] = g['Date'].diff().mask(m).groupby(df['Cod client']).ffill()
output:
Cod client Items Date Date diff
0 100 1 2022-01-01 NaT
1 100 7 2022-01-01 NaT
2 100 2 2022-02-01 31 days
3 101 5 2022-01-01 NaT
4 101 8 2022-02-01 31 days
5 101 10 2022-02-01 31 days
6 101 2 2022-04-01 59 days
7 101 2 2022-04-01 59 days
8 102 4 2022-02-01 NaT
9 102 10 2022-03-01 28 days
Another way to do this, with transform:
import pandas as pd
# data saved as .csv
df = pd.read_csv("Data.csv", header=0, parse_dates=True)
# convert Date column to correct date.
df["Date"] = pd.to_datetime(df["Date"], format="%d/%m/%Y")
# new column!
df["Date diff"] = df.sort_values("Date").groupby("Cod client")["Date"].transform(lambda x: x.diff().replace("0 days", pd.NaT).ffill())
Example: By using
df['Week_Number'] = df['Date'].dt.strftime('%U')
for 29/12/2019 the week is 52. and this week is from 29/12/2019 to 04/01/2020.
but for 01/01/2020 the week is getting as 00.
I require the week for 01/01/2020 also as 52. and for 05/01/2020 to 11/01/2020 as 53. This need to be continued.
I used a logic to solve the question.
First of all, let's write a function to create an instance of Dataframe involving dates from 2019-12-01 to 2020-01-31 by a function
def create_date_table(start='2019-12-01', end='2020-01-31'):
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Week_start_from_Monday"] = df.Date.dt.isocalendar().week
df['Week_start_from_Sunday'] = df['Date'].dt.strftime('%U')
return df
Run the function and observe the Dataframe
date_df=create_date_table()
date_df.head(n=40)
There are two fields in the Dataframe about weeks, Week_start_from_Monday and Week_start_from_Sunday, the difference come from they count Monday or Sunday as the first day of a week.
In this case, Week_start_from_Sunday is the one we need to focus on.
Now we write a function to add a column containing weeks continuing from last year, not reset to 00 when we enter a new year.
def add_continued_week_field(date: Timestamp, df_start_date: str = '2019-12-01') -> int:
start_date = datetime.strptime(df_start_date, '%Y-%m-%d')
year_of_start_date = start_date.year
year_of_date = date.year
week_of_date = date.strftime("%U")
year_diff = year_of_date - year_of_start_date
if year_diff == 0:
continued_week = int(week_of_date)
else:
continued_week = year_diff * 52 + int(week_of_date)
return continued_week
Let's apply the function add_continued_week_field to the dates' Dataframe.
date_df['Week_continue'] = date_df['Date'].apply(add_continued_week_field)
We can see the new added field in the dates' Dataframe
As stated in converting a pandas date to week number, you can use df['Date'].dt.week to get week numbers.
To let it continue you maybe could sum up the last week number with new week-values, something like this? I cannot test this right now...
if(df['Date'].dt.strftime('%U') == 53):
last = df['Date'].dt.strftime('%U')
df['Week_Number'] = last + df['Date'].dt.strftime('%U')
You can do this with isoweek and isoyear.
I don't see how you arrive at the values you present with '%U' so I will assume that you want to map the week starting on Sunday 2019-12-29 ending on 2020-01-04 to 53, and that you want to map the following week to 54 and so on.
For weeks to continue past the year you need isoweek.
isocalendar() provides a tuple with isoweek in the second element and a corresponding unique isoyear in the first element.
But isoweek starts on Monday so we have to add one day so the Sunday is interpreted as Monday and counted to the right week.
2019 is subtracted to have years starting from 0, then every year is multiplied with 53 and the isoweek is added. Finally there is an offset of 1 so you arrive at 53.
In [0]: s=pd.Series(["29/12/2019", "01/01/2020", "05/01/2020", "11/01/2020"])
dts = pd.to_datetime(s,infer_datetime_format=True)
In [0]: (dts + pd.DateOffset(days=1)).apply(lambda x: (x.isocalendar()[0] -2019)*53 + x.isocalendar()[1] -1)
Out[0]:
0 53
1 53
2 54
3 54
dtype: int64
This of course assumes that all iso years have 53 weeks which is not the case, so instead you would want to compute the number of iso weeks per iso year since 2019 and sum those up.
Maybe you are looking for this. I fixed an epoch. If you have dates earlier than 2019, you can choose other epoch.
epoch= pd.Timestamp("2019-12-23")
# Test data:
df=pd.DataFrame({"Date":pd.date_range("22/12/2019",freq="1D",periods=25)})
df["Day_name"]=df.Date.dt.day_name()
# Calculation:
df["Week_Number"]=np.where(df.Date.astype("datetime64").le(epoch), \
df.Date.dt.week, \
df.Date.sub(epoch).dt.days//7+52)
df
Date Day_name Week_Number
0 2019-12-22 Sunday 51
1 2019-12-23 Monday 52
2 2019-12-24 Tuesday 52
3 2019-12-25 Wednesday 52
4 2019-12-26 Thursday 52
5 2019-12-27 Friday 52
6 2019-12-28 Saturday 52
7 2019-12-29 Sunday 52
8 2019-12-30 Monday 53
9 2019-12-31 Tuesday 53
10 2020-01-01 Wednesday 53
11 2020-01-02 Thursday 53
12 2020-01-03 Friday 53
13 2020-01-04 Saturday 53
14 2020-01-05 Sunday 53
15 2020-01-06 Monday 54
16 2020-01-07 Tuesday 54
17 2020-01-08 Wednesday 54
18 2020-01-09 Thursday 54
19 2020-01-10 Friday 54
20 2020-01-11 Saturday 54
21 2020-01-12 Sunday 54
22 2020-01-13 Monday 55
23 2020-01-14 Tuesday 55
24 2020-01-15 Wednesday 55
I got here wanting to know how to label consecutive weeks - I'm not sure if that's exactly what the question is asking but I think it might be. So here is what I came up with:
# Create dataframe with example dates
# It has a datetime index and a column with day of week (just to check that it's working)
dates = pd.date_range('2019-12-15','2020-01-10')
df = pd.DataFrame(dates.dayofweek,index=dates,columns=['dow'])
# Add column
# THESE ARE THE RELEVANT LINES
woy = df.index.weekofyear
numbered = np.cumsum(np.diff(woy,prepend=woy[0])!=0)
# Append for easier comparison
df['week_num'] = numbered
df then looks like this:
dow week_num
2019-12-15 6 0
2019-12-16 0 1
2019-12-17 1 1
2019-12-18 2 1
2019-12-19 3 1
2019-12-20 4 1
2019-12-21 5 1
2019-12-22 6 1
2019-12-23 0 2
2019-12-24 1 2
2019-12-25 2 2
2019-12-26 3 2
2019-12-27 4 2
2019-12-28 5 2
2019-12-29 6 2
2019-12-30 0 3
2019-12-31 1 3
2020-01-01 2 3
2020-01-02 3 3
2020-01-03 4 3
2020-01-04 5 3
2020-01-05 6 3
2020-01-06 0 4
2020-01-07 1 4
2020-01-08 2 4
2020-01-09 3 4
2020-01-10 4 4
I have a dataframe with columns date, day and week of the year.
I need to use week of the year to create a new column with values from 1 to 5.
Lets say i'm on week 35 all the columns with week 35 should have one, the weeks with 36 should have 2 and so on.
Once it reaches week 40 and number 5 the numbers in the new column need to start from 1 at week 41 and continue in this kind of pattern for however long the data range is
def date_table(start='2019-08-26', end='2019-10-27'):
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Day"] = df.Date.dt.weekday_name
df["Week"] = df.Date.dt.weekofyear
return df
Calculate the index using modulo and the weeknumber:
import pandas as pd
start='2019-08-26'
end='2019-10-27'
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df["Day"] = df.Date.dt.weekday_name
df["Week"] = df.Date.dt.weekofyear
df["idx"] = df["Week"] % 5 +1 # n % 5 = 0..4 plus 1 == 1..5
print(df)
Output:
0 2019-08-26 Monday 35 1
[...]
6 2019-09-01 Sunday 35 1
7 2019-09-02 Monday 36 2
[...]
13 2019-09-08 Sunday 36 2
14 2019-09-09 Monday 37 3
[...]
20 2019-09-15 Sunday 37 3
21 2019-09-16 Monday 38 4
[...]
27 2019-09-22 Sunday 38 4
28 2019-09-23 Monday 39 5
[...]
34 2019-09-29 Sunday 39 5
35 2019-09-30 Monday 40 1
[...]
[63 rows x 4 columns]
If you want it to start on a not by 5 divisible weeknumber - you can do that too by substracting the modulo 5 value of the first week for all weeknumbers:
# startweeknumber:
startweekmod = df["Week"][0] % 5
# offset by inital weeks mod
df["idx"] = (df["Week"] - startweekmod) % 5 + 1
I have a pandas dataframe and I need to work out the cumulative sum for each month.
Date Amount
2017/01/12 50
2017/01/12 30
2017/01/15 70
2017/01/23 80
2017/02/01 90
2017/02/01 10
2017/02/02 10
2017/02/03 10
2017/02/03 20
2017/02/04 60
2017/02/04 90
2017/02/04 100
The cumulative sum is the trailing sum for each day i.e 01-31. However, some days are missing. The data frame should look like
Date Sum_Amount
2017/01/12 80
2017/01/15 150
2017/01/23 203
2017/02/01 100
2017/02/02 110
2017/02/03 140
2017/02/04 390
You can use if only need cumsum by months groupby with sum and then group by values of index converted to month:
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 140
6 2017-02-04 390
But if need but months and years need convert to month period by to_period:
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
Difference is better seen in changed df - added different year:
print (df)
Date Amount
0 2017/01/12 50
1 2017/01/12 30
2 2017/01/15 70
3 2017/01/23 80
4 2017/02/01 90
5 2017/02/01 10
6 2017/02/02 10
7 2017/02/03 10
8 2018/02/03 20
9 2018/02/04 60
10 2018/02/04 90
11 2018/02/04 100
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.month).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 140
7 2018-02-04 390
df.Date = pd.to_datetime(df.Date)
df = df.groupby('Date').Amount.sum()
df = df.groupby(df.index.to_period('m')).cumsum().reset_index()
print (df)
Date Amount
0 2017-01-12 80
1 2017-01-15 150
2 2017-01-23 230
3 2017-02-01 100
4 2017-02-02 110
5 2017-02-03 120
6 2018-02-03 20
7 2018-02-04 270