I want to compare a timestamp datatype datetime64[ns] with a datetime.date I only want a comparison based on day and month
df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
3 2023-03-31 14:15:07.018540 103.0
cu_date = datetime.datetime.now().date()
cu_year = cu_date.year
check_end_date = datetime.datetime.strptime(f'{cu_year}-11-05', '%Y-%m-%d').date()
check_start_date = datetime.datetime.strptime(f'{cu_year}-03-12', '%Y-%m-%d').date()
# this is incorrect as the day can be greater than check_start_date while the month might be less.
daylight_off_df = df.loc[((df.timestamp.dt.month >= check_end_date.month) & (df.timestamp.dt.day >= check_end_date.day)) |
((df.timestamp.dt.month <= check_start_date.month) & (df.timestamp.dt.day <= check_start_date.day))]
daylight_on_df = df.loc[((df.timestamp.dt.month <= check_end_date.month) & (df.timestamp.dt.day <= check_end_date.day)) &
((df.timestamp.dt.month >= check_start_date.month) & (df.timestamp.dt.day >= check_start_date.day))]
I am trying to think up of the logic to do this, but failing.
Expected output:
daylight_off_df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
daylight_on_df
timestamp last_price
3 2023-03-31 14:15:07.018540 103.0
In summation separate the dataframe as per day and month comparison while ignoring the year.
I would break out these values then just query
df['day'] = df['timestamp'].dt.day_name()
df['month'] = df['timestamp'].dt.month_name()
then whatever you're looking for:
df.groupby('month').mean()
The following parameters could be helpfull if you dont want an additional column in your table:
check_end_date.timetuple().tm_yday # returns day of the year
#output 309
check_start_date.timetuple().tm_yday
#output 71
df['timestamp'].dt.is_leap_year.astype(int) #returns 1 if year is a leapyear
#output 0 | 1
df['timestamp'].dt.dayofyear #returns day of the year
#output
#0 22
#1 25
#2 30
#3 90
df['timestamp'].dt.dayofyear.between(a,b) #returns true if day is between a,b
there are some possible solutions now. i think using between can be the nicest looking one.
daylight_on_df4 = df.loc[df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
daylight_off_df4 = df.loc[~df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
or the code could look like this:
daylight_on_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear > 0)
& (df['timestamp'].dt.dayofyear - (df['timestamp'].dt.is_leap_year.astype(int) + check_start_date.timetuple().tm_yday) > 0)]
daylight_off_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear < 0)
| (df['timestamp'].dt.dayofyear - (check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) < 0)]
All daylight_on/off are doing now is checking if the day of the year is inbetween your ranges or not (inclusive leap year).
This formula has probably to be rewritten if your startdate / enddate would cross a year (ex 2022-11-19 , 2023-02-22) but i think it provides a general idea.
Related
I have a dataframe with som "NaN" and Outlier values which I want to fill with the mean value of the specific month.
df[["arrival_day", "ib_units", "month","year"]]
arrival_day ib_units month year
37 2020-01-01 262 1 2020
235 2020-01-02 2301 1 2020
290 2020-01-02 145 1 2020
476 2020-01-02 6584 1 2020
551 2020-01-02 30458 1 2020
... ... ... ... ...
1479464 2022-07-19 56424 7 2022
1479490 2022-07-19 130090 7 2022
1479510 2022-07-19 3552 7 2022
1479556 2022-07-19 23779 7 2022
1479756 2022-07-20 2882 7 2022
I know there is the pandas.DataFrame.fillna function df.fillna(df.mean()), but in this case it would build the overall mean for the whole dataset. I want to fill the "NaNs" with the mean value of the specific month in this specific year.
This is what I have tried but this solution is not straightforward and only calculates the mean by year and not the mean by month:
mask_2020 = (df['arrival_day'] >= '2020-01-01') & (df['arrival_day'] <= '2020-12-31')
df_2020 = df.loc[mask_2020]
mask_2021 = (df['arrival_day'] >= '2021-01-01') & (df['arrival_day'] <= '2021-12-31')
df_2021 = df.loc[mask_2021]
mask_2022 = (df['arrival_day'] >= '2022-01-01') & (df['arrival_day'] <= '2022-12-31')
df_2022 = df.loc[mask_2022]
mean_2020 = df_2020.ib_units.mean()
mean_2021 = df_2021.ib_units.mean()
mean_2022 = df_2022.ib_units.mean()
# this finds quartile outliers and replaces them with the mean value of the specific year
for x in ['ib_units']:
q75,q25 = np.percentile(df_2020.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2020.loc[df_2020[x] < min,x] = mean_2020
df_2020.loc[df_2020[x] > max,x] = mean_2020
for x in ['ib_units']:
q75,q25 = np.percentile(df_2021.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2021.loc[df_2021[x] < min,x] = mean_2021
df_2021.loc[df_2021[x] > max,x] = mean_2021
for x in ['ib_units']:
q75,q25 = np.percentile(df_2022.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2022.loc[df_2022[x] < min,x] = mean_2022
df_2022.loc[df_2022[x] > max,x] = mean_2022
So how can I do this in a more code effective way and also by month and not by year?
Thanks!
I think you are overthinking. Please see the code below if it works for you. For outlier the code should be the same as the filling the N/A
import pandas as pd
import datetime as dt
# Sample Data:
df = pd.DataFrame({'date': ['2000-01-02', '2000-01-02', '2000-01-15', '2000-01-27',
'2000-06-03', '2000-06-29', '2000-06-15', '2000-06-29',
'2001-01-02', '2001-01-02', '2001-01-15', '2001-01-27'],
'val':[5,7,None,4,
8,1,None,9,
2,3,None,7]})
# Some convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# Create mean value:
tem = df.groupby(['year', 'month'])[['val']].mean().reset_index()
tem.rename(columns={'val': 'val_mean'}, inplace=True)
tem
# Merge and fill NA:
df = pd.merge(df, tem, how='left', on=['year', 'month'])
df.loc[df['val'].isna(),'val'] = df['val_mean']
want to a bit modify #PTQouc code and rely on his dataframe
Grouping
tem = df.groupby(['year','month'])['val'].mean().reset_index()
Merging
merged = df.merge(df1, how = 'left' , on = ['year','month'])
Using Where
merged['col_z'] = merged['val_x'].where(merged['val_x'].notnull(), merged['val_y'])
Droping
merged = merged.drop(['val_x','val_y'],axis=1)
I would like to calculate how many customers there were at each time of month in the past year. My dataframe contains customer ID, start-date (where customer started being customer) and end-date (where customer ended being customer):
Customer_ID StartDate EndDate
1 01/01/2019 NAT
2 25/10/2017 01/06/2020
2 13/06/2012 15/07/2015
2 20/12/2015 03/01/2016
2 25/03/2016 14/06/2017
3 05/06/2018 05/06/2019
3 12/12/2019 NAT
The result I would like; is counting the number of customers that were "active" per month-year combination:
MONTH YEAR NUMB_CUSTOMERS
01 2013 1
02 2013 1
03 2013 1
04 2013 1
...
01 2019 2
...
09 2020 2
I would like to avoid for-loops as that takes too much long (I have a table of over 100 000 rows).
Has anyone an idea to do this neat and quickly?
Thanks!
First, read data and make it digestible for program
import pandas as pd
import datetime
df = pd.read_csv("table.csv")
func = lambda x: x.split('/', maxsplit=1)[1]
df["StartDate"] = df["StartDate"].apply(func)
mask = df["EndDate"] != "NAT"
df.loc[mask, "EndDate"] = df.loc[mask, "EndDate"].apply(func)
Then, count changes in amount of clients (you basically get a derivative of your data)
customers_gained = df[["Customer_ID", "StartDate"]].groupby("StartDate").agg("count")
customers_lost = df[["Customer_ID", "EndDate"]].groupby("EndDate").agg("count")
customers_lost.drop("NAT",inplace=True)
make a grouper for all changes in amount of clients
def make_time_table(start, end):
start_date = datetime.datetime.strptime(start, "%d/%m/%Y")
end_date = datetime.datetime.strptime(end, "%d/%m/%Y")
data_range = pd.date_range(start_date, end_date, freq="M")
string_range = [el.strftime("%m/%Y") for el in data_range]
ser = pd.Series([0]*data_range.size, index=string_range)
return ser
Next introduce change into time_table and "integrate" by accumulation
time_table = make_time_table("01/01/2012", "01/12/2020")
time_table[customers_gained.index] = customers_gained["Customer_ID"]
time_table[customers_lost.index] -= customers_lost["Customer_ID"]
result = time_table.cumsum()
print(result)
Outputs:
01/2012 0
02/2012 0
03/2012 0
04/2012 0
05/2012 0
06/2012 1
07/2012 1
...
10/2019 2
11/2019 2
12/2019 3
01/2020 3
02/2020 3
03/2020 3
04/2020 3
05/2020 3
06/2020 2
07/2020 2
08/2020 2
09/2020 2
10/2020 2
11/2020 2
dtype: int64
table.csv
Customer_ID,StartDate,EndDate
1,01/01/2019,NAT
2,25/10/2017,01/06/2020
2,13/06/2012,15/07/2015
2,20/12/2015,03/01/2016
2,25/03/2016,14/06/2017
3,25/03/2016,05/06/2019
3,12/12/2019,NAT
I have a dataframe
id |start|stop|join_date
233| 0 | 12 |2015-01-01
234| 0 | 12 |2013-03-04
235| 10 | 23 |2014-01-10
GOAL:
I want to create another column stop_date that offsets the join_date based on whether or not the start date is 0.
If the start is 0 then stop_date is the join_date is offset by the months in stop
If the start is not 0 then stop_date is the join_date is offset by the months in stop and the months in start
I wrote the following function:
def stop_date(x):
if x['start'] == 0:
return x['join_date'] + x['stop'].astype('timedelta64[M]')
elif x['start'] != 0 :
return x['join_date'] + x['start'].astype('timedelta64[M]') + x['stop'].astype('timedelta64[M]')
else:
return x
I tried to apply to the dataframe by:
df['stop_date'] = df.apply(stop_date, axis = 1)
I keep getting an error : AttributeError: ("'int' object has no attribute 'astype'", 'occurred at index 0')
I cannot figure out how to achieve this.
Because when start is 0, doing the sum of start and stop won't change the number of month to add, you can sum both, convert with astype and add the 'join_date':
df['stop_date'] = (pd.to_datetime(df['join_date'])
+ df[['start', 'stop']].sum(axis=1).astype('timedelta64[M]')
).dt.date
print (df)
id start stop join_date stop_date
0 233 0 12 2015-01-01 2016-01-01
1 234 0 12 2013-03-04 2014-03-04
2 235 10 23 2014-01-10 2016-10-10
Convert the columns to the desired dtype before you apply the function. x['stop'] is a scalar value of the datatype of the column (e.g., 12), so it has no dataframe or series methods, such as astype.
I have a dataframe that looks like this:
stuff datetime value
A 1/1/2019 3
A 1/2/2019 4
A 1/3/2019 5
A 1/4/2019 6
...
I want to create a new dataframe that looks like this:
stuff avg_3 avg_4 avg_5
A 3.4 4.5 5.5
B 2.3 4.2 6.1
where avg_3 is the avg of the last 3 days from today, avg_4 is the avg of the last 4 days from today etc grouped by stuff
How do I do that?
My current code:
df.groupby('stuff').apply(lambda x: pd.Series(dict(
day_3=(x.datetime > datetime.now() - timedelta(days = 3)).mean(),
day_7=(x.datetime > datetime.now() -timedelta(days = 7)).mean())))
Thanks in advance
Create boolean masks before groupby, add new columns by assign and groupby with mean:
m1 = df.datetime > pd.datetime.now() - pd.Timedelta(days = 3)
m2 = df.datetime > pd.datetime.now() - pd.Timedelta(days = 7)
df = df.assign(day_3=m1, day_7=m2).groupby('stuff')['day_3','day_7'].mean()
I'm trying to make a program that will equally distribute employees' day off. There are 4 groups and each group has it's own weekmask for each week of the month. So far I've made a code that will change weekmask when it locates 0 in Dataframe(Sunday). I'm stuck on structuring this command np.busday_count(start, end, weekmask=) to automatically change the start and the end date.
My Dataframe looks like this:
And here's my code:
a: int = 0
week_mask: str = '1100111'
def _change_week_mask():
global a, week_mask
a += 1
if a == 1:
week_mask = '1111000'
elif a == 2:
week_mask = '1111111'
elif a == 3:
week_mask = '0011111'
else:
a = 0
for line in rows['Workday']:
if line is '0':
_change_week_mask()
Edit: changed the value of start week from 6 to 0.
Ok, so to answer your problem I have created the sample data frame with below code.
Then I have added below columns to the data frame.
dayofweek - to reach to similar data which you created by setting every Sunday as zero. In this case Monday is set as zero and Sunday is six.
weeknum - week of year
week - instead of counting and than changing the week mask, I have assigned the value to week from 0 to 3 and based on it, we can calculate the mask.
weekmask - using value of the week, I have calculate the mask, you might need to align this as per your logic.
weekenddate- end date I have calculate by adding 7 to start date, if month is changing mid week then this will have month end date.
b
after this we can create a new data frame to have only end of week entry, in this case Monday is 0 so I have taken 0.
then you can apply function and store the result to data frame.
import datetime
import pandas as pd
import numpy as np
df_ = pd.DataFrame({'startdate':pd.date_range(pd.to_datetime('2018-10-01'), pd.to_datetime('2018-11-30'))})
df_['dayofweek'] = df_.startdate.dt.dayofweek
df_['remaining_days_in_month'] = df_.startdate.dt.days_in_month - df_.startdate.dt.day
df_['week'] = df_.startdate.dt.week%4
df_['day'] = df_.startdate.dt.day
df_['weekmask'] = df_.week.map({0 : '1100111', 1 : '1111000' , 2 : '1111111', 3: '0011111'})
df_['weekenddate'] = [x[0] + datetime.timedelta(days=(7-x[1])) if x[2] > 7-x[1] else x[0] + datetime.timedelta(days=(x[2])) for x in df_[['startdate','dayofweek','remaining_days_in_month']].values]
final_df = df_[(df_['dayofweek']==0) | ( df_['day']==1)][['startdate','weekenddate','weekmask']]
final_df['numberofdays'] = [ np.busday_count((x[0]).astype('<M8[D]'), x[1].astype('<M8[D]'), weekmask=x[2]) for x in final_df.values.astype(str)]
Output:
startdate weekenddate weekmask numberofdays
0 2018-10-01 2018-10-08 1100111 5
7 2018-10-08 2018-10-15 1111000 4
14 2018-10-15 2018-10-22 1111111 7
21 2018-10-22 2018-10-29 0011111 5
28 2018-10-29 2018-10-31 1100111 2
31 2018-11-01 2018-11-05 1100111 3
35 2018-11-05 2018-11-12 1111000 4
42 2018-11-12 2018-11-19 1111111 7
49 2018-11-19 2018-11-26 0011111 5
56 2018-11-26 2018-11-30 1100111 2
let me know if this needs some changes as per your requirement.