I have df :
ID date
1 05-01
2 04-08
3 06-08
4 03-07
...
and a date range from 01-01-2013 until 12-31-2013: pd.date_range(start='1/1/2013', end='31/12/2013')
I want for each date in the df, to get the difference in days between this date to each date in the date range. for example to get such df :
05-01 - 01-01 04-08 - 01-01
05-01 - 01-02 04-08 - 01-02
05-01 - 01-03 04-08 - 01-03
.. ..
05-01 - 12-31 04-08-12-31
and so on for each date..
Thanks
We first need to convert the date column to datetime. Then we can perform the subtraction.
Data:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3, 4],
'date': ['05-01', '04-08', '06-08', '03-07']
})
If date is in the format of %d-%m we can converting to_datetime by appending the year and set dayfirst=True. Then we can broadcast subtraction and create a new dataframe from the results:
# append year and convert to datetime64[ns]
df['date'] = pd.to_datetime(df['date'] + '-2013', dayfirst=True)
# Build the date range
dr = pd.date_range(start='1/1/2013', end='31/12/2013')
# Create new DataFrame from the broadcasted subtraction.
new_df = pd.DataFrame(
df['date'].values - dr.values[:, None],
columns=df['date'].rename(None),
index=dr
)
df:
ID date
0 1 2013-01-05
1 2 2013-08-04
2 3 2013-08-06
3 4 2013-07-03
new_df (day first):
2013-01-05 2013-08-04 2013-08-06 2013-07-03
2013-01-01 4 days 215 days 217 days 183 days
2013-01-02 3 days 214 days 216 days 182 days
2013-01-03 2 days 213 days 215 days 181 days
2013-01-04 1 days 212 days 214 days 180 days
2013-01-05 0 days 211 days 213 days 179 days
... ... ... ... ...
2013-12-27 -356 days -145 days -143 days -177 days
2013-12-28 -357 days -146 days -144 days -178 days
2013-12-29 -358 days -147 days -145 days -179 days
2013-12-30 -359 days -148 days -146 days -180 days
2013-12-31 -360 days -149 days -147 days -181 days
[365 rows x 4 columns]
If date is in format %m-%d we can prepend the year. The subtraction is exactly the same:
# Prepend year and convert to datetime64[ns]
df['date'] = pd.to_datetime('2013-' + df['date'])
# Build the date range
dr = pd.date_range(start='1/1/2013', end='31/12/2013')
# Create new DataFrame from the broadcasted subtraction.
new_df = pd.DataFrame(
df['date'].values - dr.values[:, None],
columns=df['date'].rename(None),
index=dr
)
df:
ID date
0 1 2013-05-01
1 2 2013-04-08
2 3 2013-06-08
3 4 2013-03-07
new_df (month first):
2013-05-01 2013-04-08 2013-06-08 2013-03-07
2013-01-01 120 days 97 days 158 days 65 days
2013-01-02 119 days 96 days 157 days 64 days
2013-01-03 118 days 95 days 156 days 63 days
2013-01-04 117 days 94 days 155 days 62 days
2013-01-05 116 days 93 days 154 days 61 days
... ... ... ... ...
2013-12-27 -240 days -263 days -202 days -295 days
2013-12-28 -241 days -264 days -203 days -296 days
2013-12-29 -242 days -265 days -204 days -297 days
2013-12-30 -243 days -266 days -205 days -298 days
2013-12-31 -244 days -267 days -206 days -299 days
[365 rows x 4 columns]
If just the day number is wanted we can divide by 1 day (floor division is safe as these are all dates with no time element so will only have whole number day time deltas):
new_df = pd.DataFrame(
(df['date'].values - dr.values[:, None]) // pd.Timedelta('1D'),
columns=df['date'].rename(None),
index=dr
)
new_df:
2013-01-05 2013-08-04 2013-08-06 2013-07-03
2013-01-01 4 215 217 183
2013-01-02 3 214 216 182
2013-01-03 2 213 215 181
2013-01-04 1 212 214 180
2013-01-05 0 211 213 179
... ... ... ... ...
2013-12-27 -356 -145 -143 -177
2013-12-28 -357 -146 -144 -178
2013-12-29 -358 -147 -145 -179
2013-12-30 -359 -148 -146 -180
2013-12-31 -360 -149 -147 -181
Another way:
df["date"] = pd.to_datetime(df["date"]+"-2013", format="%m-%d-%Y")
dates = pd.date_range(start="1/1/2013", end="31/12/2013")
data = df["date"].apply(lambda x: [(x-dt).days for dt in dates]).tolist()
new_df = pd.DataFrame(data=data, index=df["date"], columns=dates).transpose()
>>> new_df
date 2013-05-01 2013-04-08 2013-06-08 2013-03-07
2013-01-01 120 97 158 65
2013-01-02 119 96 157 64
2013-01-03 118 95 156 63
2013-01-04 117 94 155 62
2013-01-05 116 93 154 61
... ... ... ...
2013-12-27 -240 -263 -202 -295
2013-12-28 -241 -264 -203 -296
2013-12-29 -242 -265 -204 -297
2013-12-30 -243 -266 -205 -298
2013-12-31 -244 -267 -206 -299
Each value in the DataFrame shows the difference in days between the column header and the row index.
I have a dataset and i want to do group by month the days. My dataset example;
Date Price
2020-01-02 23245
2020-01-03 23245
2020-01-04 23245
2020-01-05 23245
I want to this:
Date Price
2020-01 252525
2020-02 4525224
2020-03 2424552
2020-04 4552525
So, i want to sum by month while removing days.
Make sure df has correct types first, then you can group by year, month:
df["Date"] = pd.to_datetime(df.Date)
df["Price"] = df.Price.astype(float)
df["year"] = df.Date.dt.year
df["month"] = df.Date.dt.month
df.groupby([df.year, df.month], as_index=False).sum()
output:
year month Price
0 2020 1 92980
Let's assume you have a dataframe with one record per day and the price is a random number between 1 and 100, then you can groupby the month and get the sum of price.
import pandas as pd
import random
df = pd.DataFrame({'Date':pd.date_range(start='2020-01-01', end='2020-12-30', freq='D'),
'Price':[random.randint(1,100) for _ in range(365)]})
df['Month'] = df.Date.dt.strftime('%Y-%m')
print (df)
print (df.groupby('Month')['Price'].sum().reset_index())
Here's the output:
Date Price Month
0 2020-01-01 13 2020-01
1 2020-01-02 40 2020-01
2 2020-01-03 61 2020-01
3 2020-01-04 86 2020-01
4 2020-01-05 100 2020-01
.. ... ... ...
360 2020-12-26 80 2020-12
361 2020-12-27 82 2020-12
362 2020-12-28 13 2020-12
363 2020-12-29 10 2020-12
364 2020-12-30 58 2020-12
[365 rows x 3 columns]
Month Price
0 2020-01 1622
1 2020-02 1244
2 2020-03 1564
3 2020-04 1335
4 2020-05 1625
5 2020-06 1545
6 2020-07 1406
7 2020-08 1891
8 2020-09 1625
9 2020-10 1625
10 2020-11 1309
11 2020-12 1327
Datos
2015-01-01 58
2015-01-02 42
2015-01-03 41
2015-01-04 13
2015-01-05 6
... ...
2020-06-18 49
2020-06-19 41
2020-06-20 23
2020-06-21 39
2020-06-22 22
2000 rows × 1 columns
I have this df which is made up of a column whose data represents the average temperature of each day in an interval of years. I would like to know how to get the maximum of each day (taking into account that the year has 365 days) and obtain a df similar to this:
Datos
1 40
2 50
3 46
4 8
5 26
... ...
361 39
362 23
363 23
364 37
365 25
365 rows × 1 columns
Forgive my ignorance and thank you very much for the help.
You can do this:
df['Date'] = pd.to_datetime(df['Date'])
df = df.groupby(by=pd.Grouper(key='Date', freq='D')).max().reset_index()
df['Day'] = df['Date'].dt.dayofyear
print(df)
Date Temp Day
0 2015-01-01 58.0 1
1 2015-01-02 42.0 2
2 2015-01-03 41.0 3
3 2015-01-04 13.0 4
4 2015-01-05 6.0 5
... ... ... ...
1995 2020-06-18 49.0 170
1996 2020-06-19 41.0 171
1997 2020-06-20 23.0 172
1998 2020-06-21 39.0 173
1999 2020-06-22 22.0 174
Make a new column:
df["day of year"] = df.Datos.dayofyear
Then
df.groupby("day of year").max()
Week Year new
0 43 2016 2016-10-24
1 44 2016 2016-10-31
2 51 2016 2016-12-19
3 2 2017 2017-01-09
4 5 2017 2017-01-30
5 12 2017 2017-03-20
6 52 2018 2018-12-24
7 53 2018 2018-12-31
8 1 2019 2018-12-31
9 2 2019 2019-01-07
10 5 2019 2019-01-28
11 52 2019 2019-12-23
How can I add 0 infront of week if the len is 1. I need to merge Year and Week together as 201702
Try this
df["Week"] = df.Week.astype('str').str.zfill(2)
I have a dataframe that provides two integer columns with the Year and Week of the year:
import pandas as pd
import numpy as np
L1 = [43,44,51,2,5,12]
L2 = [2016,2016,2016,2017,2017,2017]
df = pd.DataFrame({"Week":L1,"Year":L2})
df
Out[72]:
Week Year
0 43 2016
1 44 2016
2 51 2016
3 2 2017
4 5 2017
5 12 2017
I need to create a datetime-object from these two numbers.
I tried this, but it throws an error:
df["DT"] = df.apply(lambda x: np.datetime64(x.Year,'Y') + np.timedelta64(x.Week,'W'),axis=1)
Then I tried this, it works but gives the wrong result, that is it ignores the week completely:
df["S"] = df.Week.astype(str)+'-'+df.Year.astype(str)
df["DT"] = df["S"].apply(lambda x: pd.to_datetime(x,format='%W-%Y'))
df
Out[74]:
Week Year S DT
0 43 2016 43-2016 2016-01-01
1 44 2016 44-2016 2016-01-01
2 51 2016 51-2016 2016-01-01
3 2 2017 2-2017 2017-01-01
4 5 2017 5-2017 2017-01-01
5 12 2017 12-2017 2017-01-01
I'm really getting lost between Python's datetime, Numpy's datetime64, and pandas Timestamp, can you tell me how it's done correctly?
I'm using Python 3, if that is relevant in any way.
EDIT:
Starting with Python 3.8 the problem is easily solved with a newly introduced method on datetime.date objects: https://docs.python.org/3/library/datetime.html#datetime.date.fromisocalendar
Try this:
In [19]: pd.to_datetime(df.Year.astype(str), format='%Y') + \
pd.to_timedelta(df.Week.mul(7).astype(str) + ' days')
Out[19]:
0 2016-10-28
1 2016-11-04
2 2016-12-23
3 2017-01-15
4 2017-02-05
5 2017-03-26
dtype: datetime64[ns]
Initially I have timestamps in s
It's much easier to parse it from UNIX epoch timestamp:
df['Date'] = pd.to_datetime(df['UNIX_Time'], unit='s')
Timing for 10M rows DF:
Setup:
In [26]: df = pd.DataFrame(pd.date_range('1970-01-01', freq='1T', periods=10**7), columns=['date'])
In [27]: df.shape
Out[27]: (10000000, 1)
In [28]: df['unix_ts'] = df['date'].astype(np.int64)//10**9
In [30]: df
Out[30]:
date unix_ts
0 1970-01-01 00:00:00 0
1 1970-01-01 00:01:00 60
2 1970-01-01 00:02:00 120
3 1970-01-01 00:03:00 180
4 1970-01-01 00:04:00 240
5 1970-01-01 00:05:00 300
6 1970-01-01 00:06:00 360
7 1970-01-01 00:07:00 420
8 1970-01-01 00:08:00 480
9 1970-01-01 00:09:00 540
... ... ...
9999990 1989-01-05 10:30:00 599999400
9999991 1989-01-05 10:31:00 599999460
9999992 1989-01-05 10:32:00 599999520
9999993 1989-01-05 10:33:00 599999580
9999994 1989-01-05 10:34:00 599999640
9999995 1989-01-05 10:35:00 599999700
9999996 1989-01-05 10:36:00 599999760
9999997 1989-01-05 10:37:00 599999820
9999998 1989-01-05 10:38:00 599999880
9999999 1989-01-05 10:39:00 599999940
[10000000 rows x 2 columns]
Check:
In [31]: pd.to_datetime(df.unix_ts, unit='s')
Out[31]:
0 1970-01-01 00:00:00
1 1970-01-01 00:01:00
2 1970-01-01 00:02:00
3 1970-01-01 00:03:00
4 1970-01-01 00:04:00
5 1970-01-01 00:05:00
6 1970-01-01 00:06:00
7 1970-01-01 00:07:00
8 1970-01-01 00:08:00
9 1970-01-01 00:09:00
...
9999990 1989-01-05 10:30:00
9999991 1989-01-05 10:31:00
9999992 1989-01-05 10:32:00
9999993 1989-01-05 10:33:00
9999994 1989-01-05 10:34:00
9999995 1989-01-05 10:35:00
9999996 1989-01-05 10:36:00
9999997 1989-01-05 10:37:00
9999998 1989-01-05 10:38:00
9999999 1989-01-05 10:39:00
Name: unix_ts, Length: 10000000, dtype: datetime64[ns]
Timing:
In [32]: %timeit pd.to_datetime(df.unix_ts, unit='s')
10 loops, best of 3: 156 ms per loop
Conclusion: I think 156 milliseconds for converting 10.000.000 rows is not that slow
Like #Gianmario Spacagna mentioned for datetimes higher like 2018 use %V with %G:
L1 = [43,44,51,2,5,12,52,53,1,2,5,52]
L2 = [2016,2016,2016,2017,2017,2017,2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})
df['new'] = pd.to_datetime(df.Week.astype(str)+
df.Year.astype(str).add('-1') ,format='%V%G-%u')
print (df)
Week Year new
0 43 2016 2016-10-24
1 44 2016 2016-10-31
2 51 2016 2016-12-19
3 2 2017 2017-01-09
4 5 2017 2017-01-30
5 12 2017 2017-03-20
6 52 2018 2018-12-24
7 53 2018 2018-12-31
8 1 2019 2018-12-31
9 2 2019 2019-01-07
10 5 2019 2019-01-28
11 52 2019 2019-12-23
There is something fishy going on with weeks starting from 2019. The ISO-8601 standard assigns the 31st December 2018 to the week 1 of year 2019. The other approaches based on:
pd.to_datetime(df.Week.astype(str)+
df.Year.astype(str).add('-2') ,format='%W%Y-%w')
will give shifted results starting from 2019.
In order to be compliant with the ISO-8601 standard you would have to do the following:
import pandas as pd
import datetime
L1 = [52,53,1,2,5,52]
L2 = [2018,2018,2019,2019,2019,2019]
df = pd.DataFrame({"Week":L1,"Year":L2})
df['ISO'] = df['Year'].astype(str) + '-W' + df['Week'].astype(str) + '-1'
df['DT'] = df['ISO'].map(lambda x: datetime.datetime.strptime(x, "%G-W%V-%u"))
print(df)
It prints:
Week Year ISO DT
0 52 2018 2018-W52-1 2018-12-24
1 53 2018 2018-W53-1 2018-12-31
2 1 2019 2019-W1-1 2018-12-31
3 2 2019 2019-W2-1 2019-01-07
4 5 2019 2019-W5-1 2019-01-28
5 52 2019 2019-W52-1 2019-12-23
The week 53 of 2018 is ignored and mapped to the week 1 of 2019.
Please verify yourself on https://www.epochconverter.com/weeks/2019.
If you want to follow ISO Week Date
Weeks start with Monday. Each week's year is the Gregorian year in
which the Thursday falls. The first week of the year, hence, always
contains 4 January. ISO week year numbering therefore slightly
deviates from the Gregorian for some days close to 1 January.
The following sample code, generates a sequence of 60 Dates, starting from 18Dec2016 Sun and adds the appropriate columns.
It adds:
A "Date"
Week Day of the "Date"
Finds the Week Starting Monday of that "Date"
Finds the Year of the Week Starting Monday of that "Date"
Adds a Week Number (ISO)
Gets the Starting Monday Date, from Year and Week Number
Sample Code Below:
# Generate Some Dates
dft1 = pd.DataFrame(pd.date_range('2016-12-18', freq='D', periods=60))
dft1.columns = ['e_FullDate']
dft1['e_FullDateWeekDay'] = dft1.e_FullDate.dt.day_name().str.slice(0,3)
#Add a Week Start Date (Monday)
dft1['e_week_start'] = dft1['e_FullDate'] - pd.to_timedelta(dft1['e_FullDate'].dt.weekday,
unit='D')
dft1['e_week_startWeekDay'] = dft1.e_week_start.dt.day_name().str.slice(0,3)
#Add a Week Start Year
dft1['e_week_start_yr'] = dft1.e_week_start.dt.year
#Add a Week Number of Week Start Monday
dft1['e_week_no'] = dft1['e_week_start'].dt.week
#Add a Week Start generate from Week Number and Year
dft1['e_week_start_from_week_no'] = pd.to_datetime(dft1.e_week_no.astype(str)+
dft1.e_week_start_yr.astype(str).add('-1') ,format='%W%Y-%w')
dft1['e_week_start_from_week_noWeekDay'] = dft1.e_week_start_from_week_no.dt.day_name().str.slice(0,3)
with pd.option_context('display.max_rows', 999, 'display.max_columns', 0, 'display.max_colwidth', 9999):
display(dft1)