I have two different columns in my dataset,
start end
0 2015-01-01 2017-01-01
1 2015-01-02 2015-06-02
2 2015-01-03 2015-12-03
3 2015-01-04 2020-11-25
4 2015-01-05 2025-07-27
I want the difference between start and end in a specific way, here's my desired output.
year_diff month_diff
2 1
0 6
0 12
5 11
10 7
Here the day is not important to me, only month and year. I've tried to period to get diff but it returns just different in months only. how can I achieve my desired output?
df['end'].dt.to_period('M') - df['start'].dt.to_period('M'))
Try:
df["year_diff"]=df["end"].dt.year.sub(df["start"].df.year)
df["month_diff"]=df["end"].dt.month.sub(df["start"].df.month)
This solution assumes that the number of days that make up a year (365) and a month (30) are constant. If the datetimes are strings, convert them into a datetime object. In a Pandas DataFrame this can be done like so
def to_datetime(dataframe):
new_dataframe = pd.DataFrame()
new_dataframe[0] = pd.to_datetime(dataframe[0], format="%Y-%m-%d")
new_dataframe[1] = pd.to_datetime(dataframe[1], format="%Y-%m-%d")
return new_dataframe
Next, column 1 can be subtracted from column 0 to give the difference in days. We can divide this number by 365 using the // operator to get the number of whole years. We can get the number of remaining days using the % operator and divide this by 30 using the // operator the get the number of whole months.
def get_time_diff(dataframe):
dataframe[2] = dataframe[1] - dataframe[0]
diff_dataframe = pd.DataFrame(columns=["year_diff", "month_diff"])
for i in range(0, dataframe.index.stop):
year_diff = dataframe[2][i].days // 365
month_diff = (dataframe[2][i].days % 365) // 30
diff_dataframe.loc[i] = [year_diff, month_diff]
return diff_dataframe
An example output from using these functions would be
start end days_diff year_diff month_diff
0 2019-10-15 2021-08-11 666 days 1 10
1 2020-02-11 2022-10-13 975 days 2 8
2 2018-12-17 2020-09-16 639 days 1 9
3 2017-01-03 2017-01-28 25 days 0 0
4 2019-12-21 2022-03-10 810 days 2 2
5 2018-08-08 2019-05-07 272 days 0 9
6 2017-06-18 2020-08-01 1140 days 3 1
7 2017-11-14 2020-04-17 885 days 2 5
8 2019-08-19 2020-05-10 265 days 0 8
9 2018-05-05 2020-09-08 857 days 2 4
Note: This will give the number of whole years and months. Hence, if there is a remainder of 29 days, one day short from a month, this will not be counted.
Related
I have a pandas data frame with dates. I need to know if every other date pair is consecutive.
2 1988-01-01
3 2015-01-31
4 2015-02-01
5 2015-05-31
6 2015-06-01
7 2021-11-16
11 2021-11-17
12 2022-10-05
8 2022-10-06
9 2022-10-12
10 2022-10-13
# How to build this example dataframe
df=pd.DataFrame({'date':pd.to_datetime(['1988-01-01','2015-01-31','2015-02-01', '2015-05-31','2015-06-01', '2021-11-16', '2021-11-17', '2022-10-05', '2022-10-06', '2022-10-12', '2022-10-13'])})
Each pair should be consecutive. I have tried different sorting but everything I see relates to the entire series being consecutive. I need to compare each pair of dates after the first date.
cb_gap = cb_sorted.sort_values('dates').groupby('dates').diff() > pd.to_timedelta('1 day')
What I need to see is this...
2 1988-01-01 <- Ignore the start date
3 2015-01-31 <- these dates have no gap
4 2015-02-01
5 2015-05-31 <- these dates have no gap
6 2015-06-01
7 2021-11-16 <- these have a gap!!!!
11 2021-11-18
12 2022-10-05 <- these have no gap
8 2022-10-06
9 2022-10-12
One way is to use shift and compute differences.
pd.DataFrame({'date':df.date,'diff':df.date.shift(-1)-df.date})[1::2]
returns
date diff
1 2015-01-31 1 days
3 2015-05-31 1 days
5 2021-11-16 1 days
7 2022-10-05 1 days
9 2022-10-12 1 days
It is also faster
Method
Timeit
Naveed's
4.23 ms
This one
0.93 ms
here is one way to do it
btw, what is your expected output? the answer get you the difference b/w the consecutive dates skipping the first row and populate diff column
# make date into datetime
df['date'] = pd.to_datetime(df['date'])
# create two intermediate DF skipping the first and taking alternate values
# and concat them along x-axis
df2=pd.concat([df.iloc[1:].iloc[::2].reset_index()[['id','date']],
df.iloc[2:].iloc[::2].reset_index()[['id','date']]
],axis=1 )
# take the difference of second date from the first one
df2['diff']=df2.iloc[:,3]-df2.iloc[:,1]
df2
id date id date diff
0 3 2015-01-31 4 2015-02-01 1 days
1 5 2015-05-31 6 2015-06-01 1 days
2 7 2021-11-16 11 2021-11-17 1 days
3 12 2022-10-05 8 2022-10-06 1 days
4 9 2022-10-12 10 2022-10-13 1 days
Be the following DataFrame in python pandas:
date
time_SEL
time_02_SEL_01
time_03_SEL_05
other
2022-01-01
34756
233232
3432423
756
2022-01-03
23322
4343
3334
343
2022-02-01
123232
3242
23423
434
2022-03-01
7323232
32423
323423
34324
All columns other than date represent a fraction of time in seconds. My idea is to pass these values to TimeDelta, keeping in mind that I only want to apply the change to columns containing the string "_SEL".
Naturally I want to apply them per string, because in the original dataset, there will be more than 3 columns with this string. If there were only 3, I would know how to do it manually.
You can apply pandas.to_timedelta on all columns selected by filter and update the original dataframe:
df.update(df.filter(like='_SEL').apply(pd.to_timedelta, unit='s'))
NB. there is no output, the modification is inplace
updated dataframe:
date time_SEL time_02_SEL time_03_SEL other
0 2022-01-01 0 days 09:39:16 2 days 16:47:12 39 days 17:27:03 756
1 2022-01-03 0 days 06:28:42 0 days 01:12:23 0 days 00:55:34 343
2 2022-02-01 1 days 10:13:52 0 days 00:54:02 0 days 06:30:23 434
3 2022-03-01 84 days 18:13:52 0 days 09:00:23 3 days 17:50:23 34324
update "TypeError: invalid type promotion"
ensure you have numbers:
(df.update(df.filter(like='_SEL')
.apply(lambda c: pd.to_timedelta(pd.to_numeric(c, errors='coerce'),
unit='s'))
)
Use DataFrame.filter for get all columns ends by _SEL, convert to timedeltas by to_timedelta and replace original by DataFrame.update:
df.update(df.filter(regex='_SEL$').apply(lambda x: pd.to_timedelta(x, unit='s')))
print (df)
date time_SEL time_02_SEL time_03_SEL other
0 2022-01-01 0 days 09:39:16 2 days 16:47:12 39 days 17:27:03 756
1 2022-01-03 0 days 06:28:42 0 days 01:12:23 0 days 00:55:34 343
2 2022-02-01 1 days 10:13:52 0 days 00:54:02 0 days 06:30:23 434
3 2022-03-01 84 days 18:13:52 0 days 09:00:23 3 days 17:50:23 34324
Another idea is filter column by Series.str.endswith:
m = df.columns.str.endswith('_SEL')
df.loc[:, m] = df.loc[:, m].apply(lambda x: pd.to_timedelta(x, unit='s'))
print (df)
date time_SEL time_02_SEL time_03_SEL other
0 2022-01-01 0 days 09:39:16 2 days 16:47:12 39 days 17:27:03 756
1 2022-01-03 0 days 06:28:42 0 days 01:12:23 0 days 00:55:34 343
2 2022-02-01 1 days 10:13:52 0 days 00:54:02 0 days 06:30:23 434
3 2022-03-01 84 days 18:13:52 0 days 09:00:23 3 days 17:50:23 34324
EDIT: For convert values of columns to integers use .astype(int):
df.update(df.filter(regex='_SEL$').astype(int).apply(lambda x: pd.to_timedelta(x, unit='s')))
If failed, because some non numeric values use:
df.update(df.filter(regex='_SEL$').apply(lambda x: pd.to_timedelta(pd.to_numeric(x, errors='coerce'), unit='s')))
Using .weekday() to find the day of the week as an integer (Monday = 0 ... Sunday = 6) for everyday from today until next year (+365 days from today). Problem now is that if the 1st of the month starts mid week then I need to return the day of the week with the 1st day of the month now being = 0.
Ex. If the month starts Wednesday then Wednesday = 0... Sunday = 4 (for that week only).
Annotated Picture of Month Explaining What I Want to Do
Originally had the below code but wrong as the first statement will run 7 days regardless.
import datetime
from datetime import date
for day in range (1,365):
departure_date = date.today() + datetime.timedelta(days=day)
if departure_date.weekday() < 7:
day_of_week = departure_date.day
else:
day_of_week = departure_date.weekday()
The following seems to do the job properly:
import datetime as dt
def custom_weekday(date):
if date.weekday() > (date.day-1):
return date.day - 1
else:
return date.weekday()
for day in range (1,366):
departure_date = dt.date.today() + dt.timedelta(days=day)
day_of_week = custom_weekday(date=departure_date)
print(departure_date, day_of_week, departure_date.weekday())
Your code had two small bugs:
the if condition was wrong
days are represented inconsistently: date.weekday() is 0-based, date.day is 1-based
For every date, get the first week of that month. Then, check if the date is within that first week. If it is, use the .day - 1 value (since you are 0-based). Otherwise, use the .weekday().
from datetime import date, datetime, timedelta
for day in range (-5, 40):
departure_date = date.today() + timedelta(days=day)
first_week = date(departure_date.year, departure_date.month, 1).isocalendar()[1]
if first_week == departure_date.isocalendar()[1]:
day_of_week = departure_date.day - 1
else:
day_of_week = departure_date.weekday()
print(departure_date, day_of_week)
2021-08-27 4
2021-08-28 5
2021-08-29 6
2021-08-30 0
2021-08-31 1
2021-09-01 0
2021-09-02 1
2021-09-03 2
2021-09-04 3
2021-09-05 4
2021-09-06 0
2021-09-07 1
2021-09-08 2
2021-09-09 3
2021-09-10 4
2021-09-11 5
2021-09-12 6
2021-09-13 0
2021-09-14 1
2021-09-15 2
2021-09-16 3
2021-09-17 4
2021-09-18 5
2021-09-19 6
2021-09-20 0
2021-09-21 1
2021-09-22 2
2021-09-23 3
2021-09-24 4
2021-09-25 5
2021-09-26 6
2021-09-27 0
2021-09-28 1
2021-09-29 2
2021-09-30 3
2021-10-01 0
2021-10-02 1
2021-10-03 2
2021-10-04 0
2021-10-05 1
2021-10-06 2
2021-10-07 3
2021-10-08 4
2021-10-09 5
2021-10-10 6
For any date D.M.Y, get the weekday W of 1.M.Y.
Then you need to adjust weekday value only for the first 7-W days of that month. To adjust, simply subtract the value W.
Example for September 2021: the first date of month (1.9.2021) is a Wednesday, so W is 2. You need to adjust weekdays for dates 1.9.2021 to 5.9.2021 (because 7-2 is 5) in that month by minus 2.
I have a dataset with meteorological features for 2019, to which I want to join two columns of power consumption datasets for 2017, 2018. I want to match them by hour, day and month, but the data belongs to different years. How can I do that?
The meteo dataset is a 6 column similar dataframe with datetimeindexes belonging to 2019.
You can from the index 3 additional columns that represent the hour, day and month and use them for a later join. DatetimeIndex has attribtues for different parts of the timestamp:
import pandas as pd
ind = pd.date_range(start='2020-01-01', end='2020-01-20', periods=10)
df = pd.DataFrame({'number' : range(10)}, index = ind)
df['hour'] = df.index.hour
df['day'] = df.index.day
df['month'] = df.index.month
print(df)
number hour day month
2020-01-01 00:00:00 0 0 1 1
2020-01-03 02:40:00 1 2 3 1
2020-01-05 05:20:00 2 5 5 1
2020-01-07 08:00:00 3 8 7 1
2020-01-09 10:40:00 4 10 9 1
2020-01-11 13:20:00 5 13 11 1
2020-01-13 16:00:00 6 16 13 1
2020-01-15 18:40:00 7 18 15 1
2020-01-17 21:20:00 8 21 17 1
2020-01-20 00:00:00 9 0 20 1
Examples of how the df looks like:
customer order_datetime
a 01-03-2017 12:00:00 PM
b 01-04-2017 12:00:00 PM
c 01-07-2017 12:00:00 PM
a 01-08-2017 12:00:00 PM
b 01-09-2017 12:00:00 PM
a 01-11-2017 12:00:00 PM
There's 2 thing that I wanted to achieve but I'm still in the learning process, really appreciate any help to guide me in the right direction.
Create a list of "time between orders" where I can find the min, mean, max
Find out if "time between order" gets faster/slower, i.e. time between order_3 and order_2 vs time between order_2 and order_1
This example should set you in the right direction for your assignment.
First I'm creating a DataFrame similar to the one you show in the question:
import pandas as pd
import numpy as np
import datetime as dt
orders = pd.DataFrame({
'client': np.random.randint(65, 70, size=15),
'date': np.random.randint(0, 30, size=15)})
orders.client = orders.client.apply(chr)
orders.date = orders.date.apply(
pd.to_datetime, unit='d', origin=dt.date(2017, 1, 1), box=False)
# Sorting here is not necessary, just for visualization
orders.sort_values(['client', 'date'], inplace=True)
orders.reset_index(inplace=True, drop=True)
orders.head()
>>>>
client date
0 A 2017-01-27
1 A 2017-01-29
2 A 2017-01-30
3 B 2017-01-03
4 B 2017-01-13
The key to the solution is in the line orders.groupby('client').date.apply(pd.Series.sort_values).diff().
First we use groupby to group the orders using client as a key, then we select the date column only and sort the dates in each group with pd.Series.sort_values, finally we use diff to compute the difference of each record with the following one (here's why the dates in each group must be sorted).
The rest of the code is just to visualize the result, i.e. renaming the Series you obtain and concatenating it with the initial DataFrame.
diff_df = pd.concat([
orders,
orders.groupby('client').date.diff().rename('diff')], axis=1)
diff_df.head(10)
>>>>
client date diff
0 A 2017-01-27 NaT
1 A 2017-01-29 2 days
2 A 2017-01-30 1 days
3 B 2017-01-03 NaT
4 B 2017-01-13 10 days
5 B 2017-01-18 5 days
6 B 2017-01-24 6 days
7 C 2017-01-01 NaT
8 C 2017-01-02 1 days
9 C 2017-01-03 1 days
Once you have the time differences you can compute all kinds of in-group metrics you need.
First you can try pd.Series.describe:
diff_df.groupby('client').diff.describe()
>>>>
count mean std min \
client
A 1 5 days 00:00:00 NaT 5 days 00:00:00
B 1 12 days 00:00:00 NaT 12 days 00:00:00
C 3 4 days 00:00:00 1 days 17:34:09.189773 2 days 00:00:00
D 1 4 days 00:00:00 NaT 4 days 00:00:00
E 4 5 days 00:00:00 3 days 03:53:40.789838 2 days 00:00:00
25% 50% 75% max
client
A 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
B 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00 12 days 00:00:00
C 3 days 12:00:00 5 days 00:00:00 5 days 00:00:00 5 days 00:00:00
D 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00 4 days 00:00:00
E 2 days 18:00:00 4 days 12:00:00 6 days 18:00:00 9 days 00:00:00
If that is not enough you can define your own aggregations.
You will need a list of functions if you work on a single Series:
metrics = [pd.Series.count, pd.Series.min, pd.Series.max, pd.Series.mean]
diff_df.groupby('client').diff.aggregate(metrics)
>>>>
count nunique min max mean
client
A 1 1 5 days 5 days 5 days
B 1 1 12 days 12 days 12 days
C 3 2 2 days 5 days 4 days
D 1 1 4 days 4 days 4 days
E 4 4 2 days 9 days 5 days
Or a dictionary of of {column -> function, column -> function_list} if you work on the whole DataFrame:
metrics = {
'date': [pd.Series.count, pd.Series.nunique],
'diff': [pd.Series.min, pd.Series.max, pd.Series.mean],
}
diff_df.groupby('client').aggregate(metrics)
>>>>
diff date
min max mean count nunique
client
A 5 days 5 days 5 days 2 2
B 12 days 12 days 12 days 2 2
C 2 days 5 days 4 days 4 4
D 4 days 4 days 4 days 2 2
E 2 days 9 days 5 days 5 5