Subtract range of dates from each date in df column pandas - python

I have df :
ID date
1 05-01
2 04-08
3 06-08
4 03-07
...
and a date range from 01-01-2013 until 12-31-2013: pd.date_range(start='1/1/2013', end='31/12/2013')
I want for each date in the df, to get the difference in days between this date to each date in the date range. for example to get such df :
05-01 - 01-01 04-08 - 01-01
05-01 - 01-02 04-08 - 01-02
05-01 - 01-03 04-08 - 01-03
.. ..
05-01 - 12-31 04-08-12-31
and so on for each date..
Thanks

We first need to convert the date column to datetime. Then we can perform the subtraction.
Data:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4],
'date': ['05-01', '04-08', '06-08', '03-07']
})
If date is in the format of %d-%m we can converting to_datetime by appending the year and set dayfirst=True. Then we can broadcast subtraction and create a new dataframe from the results:
# append year and convert to datetime64[ns]
df['date'] = pd.to_datetime(df['date'] + '-2013', dayfirst=True)
# Build the date range
dr = pd.date_range(start='1/1/2013', end='31/12/2013')
# Create new DataFrame from the broadcasted subtraction.
new_df = pd.DataFrame(
df['date'].values - dr.values[:, None],
columns=df['date'].rename(None),
index=dr
)
df:
ID date
0 1 2013-01-05
1 2 2013-08-04
2 3 2013-08-06
3 4 2013-07-03
new_df (day first):
2013-01-05 2013-08-04 2013-08-06 2013-07-03
2013-01-01 4 days 215 days 217 days 183 days
2013-01-02 3 days 214 days 216 days 182 days
2013-01-03 2 days 213 days 215 days 181 days
2013-01-04 1 days 212 days 214 days 180 days
2013-01-05 0 days 211 days 213 days 179 days
... ... ... ... ...
2013-12-27 -356 days -145 days -143 days -177 days
2013-12-28 -357 days -146 days -144 days -178 days
2013-12-29 -358 days -147 days -145 days -179 days
2013-12-30 -359 days -148 days -146 days -180 days
2013-12-31 -360 days -149 days -147 days -181 days
[365 rows x 4 columns]
If date is in format %m-%d we can prepend the year. The subtraction is exactly the same:
# Prepend year and convert to datetime64[ns]
df['date'] = pd.to_datetime('2013-' + df['date'])
# Build the date range
dr = pd.date_range(start='1/1/2013', end='31/12/2013')
# Create new DataFrame from the broadcasted subtraction.
new_df = pd.DataFrame(
df['date'].values - dr.values[:, None],
columns=df['date'].rename(None),
index=dr
)
df:
ID date
0 1 2013-05-01
1 2 2013-04-08
2 3 2013-06-08
3 4 2013-03-07
new_df (month first):
2013-05-01 2013-04-08 2013-06-08 2013-03-07
2013-01-01 120 days 97 days 158 days 65 days
2013-01-02 119 days 96 days 157 days 64 days
2013-01-03 118 days 95 days 156 days 63 days
2013-01-04 117 days 94 days 155 days 62 days
2013-01-05 116 days 93 days 154 days 61 days
... ... ... ... ...
2013-12-27 -240 days -263 days -202 days -295 days
2013-12-28 -241 days -264 days -203 days -296 days
2013-12-29 -242 days -265 days -204 days -297 days
2013-12-30 -243 days -266 days -205 days -298 days
2013-12-31 -244 days -267 days -206 days -299 days
[365 rows x 4 columns]
If just the day number is wanted we can divide by 1 day (floor division is safe as these are all dates with no time element so will only have whole number day time deltas):
new_df = pd.DataFrame(
(df['date'].values - dr.values[:, None]) // pd.Timedelta('1D'),
columns=df['date'].rename(None),
index=dr
)
new_df:
2013-01-05 2013-08-04 2013-08-06 2013-07-03
2013-01-01 4 215 217 183
2013-01-02 3 214 216 182
2013-01-03 2 213 215 181
2013-01-04 1 212 214 180
2013-01-05 0 211 213 179
... ... ... ... ...
2013-12-27 -356 -145 -143 -177
2013-12-28 -357 -146 -144 -178
2013-12-29 -358 -147 -145 -179
2013-12-30 -359 -148 -146 -180
2013-12-31 -360 -149 -147 -181

Another way:
df["date"] = pd.to_datetime(df["date"]+"-2013", format="%m-%d-%Y")
dates = pd.date_range(start="1/1/2013", end="31/12/2013")
data = df["date"].apply(lambda x: [(x-dt).days for dt in dates]).tolist()
new_df = pd.DataFrame(data=data, index=df["date"], columns=dates).transpose()
>>> new_df
date 2013-05-01 2013-04-08 2013-06-08 2013-03-07
2013-01-01 120 97 158 65
2013-01-02 119 96 157 64
2013-01-03 118 95 156 63
2013-01-04 117 94 155 62
2013-01-05 116 93 154 61
... ... ... ...
2013-12-27 -240 -263 -202 -295
2013-12-28 -241 -264 -203 -296
2013-12-29 -242 -265 -204 -297
2013-12-30 -243 -266 -205 -298
2013-12-31 -244 -267 -206 -299
Each value in the DataFrame shows the difference in days between the column header and the row index.

Related

Excel [h]:mm duration to pandas timedelta

I am importing data from an Excel worksheet where I have a 'Duration' field displayed in [h]:mm (so that the total number of hours is shown). I understand that underneath, this is simply number of days as a float.
I want to work with this as a timedelta column or similar in a Pandas dataframe but no matter what I do it's dropping any hours over 24 (e.g. the days portion).
Excel data (over 24 hours highlighted):
Pandas import (1d 7h 51m):
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 1900-01-01 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
Running a to_datetime conversion simply drops the day (integer) part of the column:
BATCH_NO Duration
354 7154 04:36:00
465 7270 06:35:00
466 7271 08:05:00
467 7272 05:54:00
468 7273 09:10:00
472 7277 06:15:00
476 7280 10:23:00
477 7284 06:09:00
499 7313 06:46:00
503 7322 05:27:00
510 7333 14:15:00
515 7335 07:51:00
516 7338 07:51:00
517 7339 09:00:00
518 7339 05:29:00
519 7339 09:00:00
520 7339 05:29:00
522 7342 12:10:00
525 7343 08:00:00
530 7346 08:25:00
I have tried importing by fixing the dtype as float, but only str or object work - dtype={'Duration': str} works.
float gives the error float() argument must be a string or a number, not 'datetime.time' and even with str or object, Python still thinks the column i a datetime.time
Ideally I do not want to change the Excel source data or export to .csv as in intermediate step.
If I got it correctly, the imported objects are datetime and time with the datetime in Julian calendar.
So you must convert with a custom function:
from datetime import datetime, time, timedelta
def convert(t):
if isinstance(t, time):
t = datetime.combine(datetime.min, t)
delta = t-datetime.min
if delta.days != 0:
delta -= timedelta(days=693594)
return delta
df['Duration'].apply(convert)
Output:
0 0 days 04:36:00
1 0 days 06:35:00
2 0 days 08:05:00
3 0 days 05:54:00
4 0 days 09:10:00
5 0 days 06:15:00
6 0 days 10:23:00
7 0 days 06:09:00
8 0 days 06:46:00
9 0 days 05:27:00
10 0 days 14:15:00
11 1 days 07:51:00 # corrected
12 0 days 07:51:00
13 0 days 09:00:00
14 0 days 05:29:00
15 0 days 09:00:00
...

Pandas: Total number of days in an year

I would like to get total days based on Year column in Totaldays column.
df_date=pd.date_range(start ='12-31-2021', end ='12-31-2026', freq ='M')
df_date1=pd.DataFrame(df_date)
df_date1['daysinmonth'] = df_date1[0].dt.daysinmonth
df_date1['year'] = df_date1[0].dt.year
df_date1['Totaldays']=?
df_date1
0 daysinmonth Year Totaldays
0 2021-12-31 31 2021 365
1 2022-01-31 31 2022 365
2 2022-02-28 28 2022 365
3 2022-03-31 31 2022 365
4 2022-04-30 30 2022 365
You can use pd.Series.dt.is_leap_year for that:
import numpy as np
import pandas as pd
df['date'] = pd.DataFrame({'date': pd.date_range('1999-01-01', '2022-01-01', freq='1y')})
df['total_days'] = np.where(df['date'].dt.is_leap_year, 366, 365)
print(df)
date total_days
0 1999-12-31 365
1 2000-12-31 366
2 2001-12-31 365
3 2002-12-31 365
4 2003-12-31 365
5 2004-12-31 366
6 2005-12-31 365
7 2006-12-31 365
8 2007-12-31 365
9 2008-12-31 366
10 2009-12-31 365
11 2010-12-31 365
12 2011-12-31 365
13 2012-12-31 366
14 2013-12-31 365
15 2014-12-31 365
16 2015-12-31 365
17 2016-12-31 366
18 2017-12-31 365
19 2018-12-31 365
20 2019-12-31 365
21 2020-12-31 366
22 2021-12-31 365
Caveat: this will only work assuming the gregorian calendar is valid for these years. E.g. not in years where countries switched from Julian to Gregorian calendar (1752 in the UK, 1918 in Russia, ...).

Monthly aggregated values, pandas dataframe

A sample CSV data in which the first column is a time stamp (date + time):
2018-01-01 10:00:00,23,43
2018-01-02 11:00:00,34,35
2018-01-05 12:00:00,25,4
2018-01-10 15:00:00,22,96
2018-01-01 18:00:00,24,53
2018-03-01 10:00:00,94,98
2018-04-20 10:00:00,90,9
2018-04-10 10:00:00,45,51
2018-01-01 10:00:00,74,44
2018-12-01 10:00:00,76,87
2018-11-01 10:00:00,76,87
2018-12-12 10:00:00,87,90
I already wrote some codes to do the monthly aggregated values task while waiting for someone to give me some suggestions.
Thanks #moys, anyway!
import pandas as pd
df = pd.read_csv('Sample.txt', header=None, names = ['Timestamp', 'Value 1', 'Value 2'])
df1['Timestamp'] = pd.to_datetime(df1['Timestamp'])
df1['Monthly'] = df1['Timestamp'].dt.to_period('M')
grouper = pd.Grouper(key='Monthly')
df2 = df1.groupby(grouper)['Value 1', 'Value 2'].sum().reset_index()
The output is:
Monthly Value 1 Value 2
0 2018-01 202 275
1 2018-03 94 98
2 2018-04 135 60
3 2018-12 163 177
4 2018-11 76 87
What if there's a dataset with more columns, how to motified the my code to make it automatically working on the dataset which has more columns?
2018-02-01 10:00:00,23,43,32
2018-02-02 11:00:00,34,35,43
2018-03-05 12:00:00,25,4,43
2018-02-10 15:00:00,22,96,24
2018-05-01 18:00:00,24,53,98
2018-02-01 10:00:00,94,98,32
2018-02-20 10:00:00,90,9,24
2018-07-10 10:00:00,45,51,32
2018-01-01 10:00:00,74,44,34
2018-12-04 10:00:00,76,87,53
2018-12-02 10:00:00,76,87,21
2018-12-12 10:00:00,87,90,98
You can do something like below
df.groupby(pd.to_datetime(df['date']).dt.month).sum().reset_index()
Output Here, 'date' column is the month number.
date val1 val2
0 1 202 275
1 3 94 98
2 4 135 60
3 11 76 87
4 12 163 177

Customised start and end date of the month

I have a data frame which contains date and value. I have to compute sum of the values for each month.
i.e., df.groupby(pd.Grouper(freq='M'))['Value'].sum()
But the problem is in my data set starting date of the month is 21 and ending at 20. Is there any way to tell that group the month from 21th day to 20th day to pandas.
Assume my data frame contains starting and ending date is,
starting_date=datetime.datetime(2015,11,21)
ending_date=datetime.datetime(2017,11,20)
so far i tried,
starting_date=df['Date'].min()
ending_date=df['Date'].max()
month_wise_sum=[]
while(starting_date<=ending_date):
temp=starting_date+datetime.timedelta(days=31)
e_y=temp.year
e_m=temp.month
e_d=20
temp= datetime.datetime(e_y,e_m,e_d)
month_wise_sum.append(df[df['Date'].between(starting_date,temp)]['Value'].sum())
starting_date=temp+datetime.timedelta(days=1)
print month_wise_sum
My above code does the thing. but still waiting for pythonic way to achieve it.
My biggest problem is slicing data frame for month wise
for example,
2015-11-21 to 2015-12-20
Is there any pythonic way to achieve this?
Thanks in Advance.
For Example consider this as my dataframe. It contains date from date_range(datetime.datetime(2017,01,21),datetime.datetime(2017,10,20))
Input:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
.. ... ...
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
[273 rows x 2 columns]
I want to slice this dataframe like below
Iter-1:
Date Value
0 2017-01-21 -1.055784
1 2017-01-22 1.643813
2 2017-01-23 -0.865919
3 2017-01-24 -0.126777
4 2017-01-25 -0.530914
5 2017-01-26 0.579418
6 2017-01-27 0.247825
7 2017-01-28 -0.951166
8 2017-01-29 0.063764
9 2017-01-30 -1.960660
10 2017-01-31 1.118236
11 2017-02-01 -0.622514
12 2017-02-02 -1.416240
13 2017-02-03 1.025384
14 2017-02-04 0.448695
15 2017-02-05 1.642983
16 2017-02-06 -1.386413
17 2017-02-07 0.774173
18 2017-02-08 -1.690147
19 2017-02-09 -1.759029
20 2017-02-10 0.345326
21 2017-02-11 0.549472
22 2017-02-12 0.814701
23 2017-02-13 0.983923
24 2017-02-14 0.551617
25 2017-02-15 0.001959
26 2017-02-16 -0.537112
27 2017-02-17 1.251595
28 2017-02-18 1.448950
29 2017-02-19 -0.452310
30 2017-02-20 0.616847
iter-2:
Date Value
31 2017-02-21 2.356993
32 2017-02-22 -0.265603
33 2017-02-23 -0.651336
34 2017-02-24 -0.952791
35 2017-02-25 0.124278
36 2017-02-26 0.545956
37 2017-02-27 0.671670
38 2017-02-28 -0.836518
39 2017-03-01 1.178424
40 2017-03-02 0.182758
41 2017-03-03 -0.733987
42 2017-03-04 0.112974
43 2017-03-05 -0.357269
44 2017-03-06 1.454310
45 2017-03-07 -1.201187
46 2017-03-08 0.212540
47 2017-03-09 0.082771
48 2017-03-10 -0.906591
49 2017-03-11 -0.931166
50 2017-03-12 -0.391388
51 2017-03-13 -0.893409
52 2017-03-14 -1.852290
53 2017-03-15 0.368390
54 2017-03-16 -1.672943
55 2017-03-17 -0.934288
56 2017-03-18 -0.154785
57 2017-03-19 0.552378
58 2017-03-20 0.096006
.
.
.
iter-n:
Date Value
243 2017-09-21 0.791439
244 2017-09-22 1.368647
245 2017-09-23 0.504924
246 2017-09-24 0.214994
247 2017-09-25 -3.020875
248 2017-09-26 -0.440378
249 2017-09-27 1.324862
250 2017-09-28 0.116897
251 2017-09-29 -0.114449
252 2017-09-30 -0.879000
253 2017-10-01 0.088985
254 2017-10-02 -0.849833
255 2017-10-03 1.136802
256 2017-10-04 -0.398931
257 2017-10-05 0.067660
258 2017-10-06 1.080505
259 2017-10-07 0.516830
260 2017-10-08 -0.755461
261 2017-10-09 1.367292
262 2017-10-10 1.444083
263 2017-10-11 -0.840497
264 2017-10-12 -0.090092
265 2017-10-13 0.193068
266 2017-10-14 -0.284673
267 2017-10-15 -1.128397
268 2017-10-16 1.029995
269 2017-10-17 -1.269262
270 2017-10-18 0.320187
271 2017-10-19 0.580825
272 2017-10-20 1.001110
So that i could calculate each month's sum of value series
[0.7536957367200978, -4.796100620186059, -1.8423374363366014, 2.3780759926221267, 5.753755441349653, -0.01072884830461407, -0.24877912707664018, 11.666305431020149, 3.0772592888909065]
I hope i explained thoroughly.
For the purpose of testing my solution, I generated some random data, frequency is daily but it should work for every frequencies.
index = pd.date_range('2015-11-21', '2017-11-20')
df = pd.DataFrame(index=index, data={0: np.random.rand(len(index))})
Here you see that I passed as index an array of datetimes. Indexing with dates allow in pandas for a lot of added functionalities. With your data you should do (if the Date column already only contains datetime values) :
df = df.set_index('Date')
Then I would realign artificially your data by substracting 20 days to the index :
from datetime import timedelta
df.index -= timedelta(days=20)
and then I would resample data to a monthly indexing, summing all data in the same month :
df.resample('M').sum()
The resulting dataframe is indexed by the last datetime of each month (for me something like :
0
2015-11-30 3.191098
2015-12-31 16.066213
2016-01-31 16.315388
2016-02-29 13.507774
2016-03-31 15.939567
2016-04-30 17.094247
2016-05-31 15.274829
2016-06-30 13.609203
but feel free to reindex it :)
Using pandas.cut() could be a quick solution for you:
import pandas as pd
import numpy as np
start_date = "2015-11-21"
# As #ALollz mentioned, the month with the original end_date='2017-11-20' was missing.
# since pd.date_range() only generates dates in the specified range (between start= and end=),
# '2017-11-31'(using freq='M') exceeds the original end='2017-11-20' and thus is cut off.
# the similar situation applies also to start_date (using freq="MS") when start_month might be cut off
# easy fix is just to extend the end_date to a date in the next month or use
# the end-date of its own month '2017-11-30', or replace end= to periods=25
end_date = "2017-12-20"
# create a testing dataframe
df = pd.DataFrame({ "date": pd.date_range(start_date, periods=710, freq='D'), "value": np.random.randn(710)})
# set up bins to include all dates to create expected date ranges
bins = [ d.replace(day=20) for d in pd.date_range(start_date, end_date, freq="M") ]
# group and summary using the ranges from the above bins
df.groupby(pd.cut(df.date, bins)).sum()
value
date
(2015-11-20, 2015-12-20] -5.222231
(2015-12-20, 2016-01-20] -4.957852
(2016-01-20, 2016-02-20] -0.019802
(2016-02-20, 2016-03-20] -0.304897
(2016-03-20, 2016-04-20] -7.605129
(2016-04-20, 2016-05-20] 7.317627
(2016-05-20, 2016-06-20] 10.916529
(2016-06-20, 2016-07-20] 1.834234
(2016-07-20, 2016-08-20] -3.324972
(2016-08-20, 2016-09-20] 7.243810
(2016-09-20, 2016-10-20] 2.745925
(2016-10-20, 2016-11-20] 8.929903
(2016-11-20, 2016-12-20] -2.450010
(2016-12-20, 2017-01-20] 3.137994
(2017-01-20, 2017-02-20] -0.796587
(2017-02-20, 2017-03-20] -4.368718
(2017-03-20, 2017-04-20] -9.896459
(2017-04-20, 2017-05-20] 2.350651
(2017-05-20, 2017-06-20] -2.667632
(2017-06-20, 2017-07-20] -2.319789
(2017-07-20, 2017-08-20] -9.577919
(2017-08-20, 2017-09-20] 2.962070
(2017-09-20, 2017-10-20] -2.901864
(2017-10-20, 2017-11-20] 2.873909
# export the result
summary = df.groupby(pd.cut(df.date, bins)).value.sum().tolist()
..

How to calculate day's difference between successive pandas dataframe rows with condition

I have a pandas dataframe like following..
item_id date
101 2016-01-05
101 2016-01-21
121 2016-01-08
121 2016-01-22
128 2016-01-19
128 2016-02-17
131 2016-01-11
131 2016-01-23
131 2016-01-24
131 2016-02-06
131 2016-02-07
I want to calculate days difference between date column but with respect to item_id column. First I want to sort the dataframe with date grouping on item_id. It should look like this
item_id date
101 2016-01-05
101 2016-01-08
121 2016-01-21
121 2016-01-22
128 2016-01-17
128 2016-02-19
131 2016-01-11
131 2016-01-23
131 2016-01-24
131 2016-02-06
131 2016-02-07
Then I want to calculate the difference between dates again grouping on item_id So the output should look like following
item_id date day_difference
101 2016-01-05 0
101 2016-01-08 3
121 2016-01-21 0
121 2016-01-22 1
128 2016-01-17 0
128 2016-02-19 2
131 2016-01-11 0
131 2016-01-23 12
131 2016-01-24 1
131 2016-02-06 13
131 2016-02-07 1
For sorting I used something like this
df.groupby('item_id').apply(lambda x: new_df.sort('date'))
But,it didn't work out. I am able to calculate the difference between consecutive rows by following
(df['date'] - df['date'].shift(1))
But not for grouping with item_id
I think you can use:
df['date'] = df.groupby('item_id')['date'].apply(lambda x: x.sort_values())
df['diff'] = df.groupby('item_id')['date'].diff() / np.timedelta64(1, 'D')
df['diff'] = df['diff'].fillna(0)
print df
item_id date diff
0 101 2016-01-05 0
1 101 2016-01-21 16
2 121 2016-01-08 0
3 121 2016-01-22 14
4 128 2016-01-19 0
5 128 2016-02-17 29
6 131 2016-01-11 0
7 131 2016-01-23 12
8 131 2016-01-24 1
9 131 2016-02-06 13
10 131 2016-02-07 1
You can also try:
df.date.diff().fillna(pd.Timedelta(seconds=0))
Note: .fillna(0) is no longer supported for timedelta dtype

Categories

Resources