[pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
Is there anyway to speed up this operation?
Basically for a given date range I am creating all rows of dates in between them.
Use DataFrame.itertuples:
L = [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
Or zip of both columns:
L = [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
If want join together:
s = pd.concat(L, ignore_index=True)
Performance for 100 rows:
np.random.seed(123)
def random_dates(start, end, n=100):
start_u = start.value//10**9
end_u = end.value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end)})
print (df)
In [155]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
33.5 ms ± 145 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [156]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
30.3 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [157]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
25.3 ms ± 218 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [158]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
24.3 ms ± 594 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
And for 1000 rows:
start = pd.to_datetime('2015-01-01')
end = pd.to_datetime('2018-01-01')
df = pd.DataFrame({'START_DATE': start, 'END_DATE':random_dates(start, end, n=1000)})
In [159]: %timeit [pd.Series(pd.date_range(row[1].START_DATE, row[1].END_DATE)) for row in df[['START_DATE', 'END_DATE']].iterrows()]
333 ms ± 3.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [160]: %timeit [pd.date_range(row[1].START_DATE, row[1].END_DATE) for row in df[['START_DATE', 'END_DATE']].iterrows()]
314 ms ± 36.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [161]: %timeit [pd.Series(pd.date_range(s, e)) for s, e in zip(df['START_DATE'], df['END_DATE'])]
243 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [162]: %timeit [pd.Series(pd.date_range(r.START_DATE, r.END_DATE)) for r in df.itertuples()]
246 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Instead of creating a pd.Series on each iteration, do:
[pd.date_range(row[1].START_DATE, row[1].END_DATE))
for row in df[['START_DATE', 'END_DATE']].iterrows()]
And create a dataframe from the result. Here's an example:
df = pd.DataFrame([
{'start_date': pd.datetime(2019,1,1), 'end_date': pd.datetime(2019,1,10)},
{'start_date': pd.datetime(2019,1,2), 'end_date': pd.datetime(2019,1,8)},
{'start_date': pd.datetime(2019,1,6), 'end_date': pd.datetime(2019,1,14)}
])
dr = [pd.date_range(df.loc[i,'start_date'], df.loc[i,'end_date']) for i,_ in df.iterrows()]
pd.DataFrame(dr)
0 1 2 3 4 5 \
0 2019-01-01 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06
1 2019-01-02 2019-01-03 2019-01-04 2019-01-05 2019-01-06 2019-01-07
2 2019-01-06 2019-01-07 2019-01-08 2019-01-09 2019-01-10 2019-01-11
6 7 8 9
0 2019-01-07 2019-01-08 2019-01-09 2019-01-10
1 2019-01-08 NaT NaT NaT
2 2019-01-12 2019-01-13 2019-01-14 NaT
Related
I am trying to set timezone to a datetime column, based on another column containing the time zone.
Example data:
DATETIME VALUE TIME_ZONE
0 2021-05-01 00:00:00 1.00 Europe/Athens
1 2021-05-01 00:00:00 2.13 Europe/London
2 2021-05-01 00:00:00 5.13 Europe/London
3 2021-05-01 01:00:00 4.25 Europe/Dublin
4 2021-05-01 01:00:00 4.25 Europe/Paris
I am trying to assign a time zone to the DATETIME column, but using the tz_localize method, I cannot avoid using an apply call, which will be very slow on my large dataset. Is there some way to do this without using apply?
What I have now (which is slow):
df['DATETIME_WITH_TZ'] = df.apply(lambda row: row['DATETIME'].tz_localize(row['TIME_ZONE']), axis=1)
I'm not sure but a listcomp seems to be x17 faster than apply in your case :
df["DATETIME_WITH_TZ"] = [dt.tz_localize(tz)
for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]
Another variant, with tz_convert :
df["DATETIME_WITH_TZ"] = [dt.tz_localize("UTC").tz_convert(tz)
for dt,tz in zip(df["DATETIME"], df["TIME_ZONE"])]
Timing :
#%%timeit #listcomp1
47.4 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
#%%timeit #listcomp2
25.7 µs ± 1.94 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
#%%timeit #apply
457 µs ± 16.9 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
Output :
print(df)
DATETIME VALUE TIME_ZONE DATETIME_WITH_TZ
0 2021-05-01 00:00:00 1.00 Europe/Athens 2021-05-01 03:00:00+03:00
1 2021-05-01 00:00:00 2.13 Europe/London 2021-05-01 01:00:00+01:00
2 2021-05-01 00:00:00 5.13 Europe/London 2021-05-01 01:00:00+01:00
3 2021-05-01 01:00:00 4.25 Europe/Dublin 2021-05-01 02:00:00+01:00
4 2021-05-01 01:00:00 4.25 Europe/Paris 2021-05-01 03:00:00+02:00
I need to generate a list of dates in a dataframe by days and that each day is a row in the new dataframe, taking into account the start date and the end date of each record.
Input Dataframe:
A
B
Start
End
A1
B1
2021-05-15 00:00:00
2021-05-17 00:00:00
A1
B2
2021-05-30 00:00:00
2021-06-02 00:00:00
A2
B3
2021-05-10 00:00:00
2021-05-12 00:00:00
A2
B4
2021-06-02 00:00:00
2021-06-04 00:00:00
Expected Output:
A
B
Start
End
A1
B1
2021-05-15 00:00:00
2021-05-16 00:00:00
A1
B1
2021-05-16 00:00:00
2021-05-17 00:00:00
A1
B2
2021-05-30 00:00:00
2021-05-31 00:00:00
A1
B2
2021-05-31 00:00:00
2021-06-01 00:00:00
A1
B2
2021-06-01 00:00:00
2021-06-02 00:00:00
A2
B3
2021-05-10 00:00:00
2021-05-11 00:00:00
A2
B3
2021-05-11 00:00:00
2021-05-12 00:00:00
A2
B4
2021-06-02 00:00:00
2021-06-03 00:00:00
A2
B4
2021-06-03 00:00:00
2021-06-04 00:00:00
Use:
#convert columns to datetimes
df["Start"] = pd.to_datetime(df["Start"])
df["End"] = pd.to_datetime(df["End"])
#subtract values and convert to days
s = df["End"].sub(df["Start"]).dt.days
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas, add 1 day for End column
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Start'] = df["Start"].add(add)
df['End'] = df["Start"] + pd.Timedelta(1, 'd')
#default index
df = df.reset_index(drop=True)
print (df)
A B Start End
0 A1 B1 2021-05-15 2021-05-16
1 A1 B1 2021-05-16 2021-05-17
2 A1 B2 2021-05-30 2021-05-31
3 A1 B2 2021-05-31 2021-06-01
4 A1 B2 2021-06-01 2021-06-02
5 A2 B3 2021-05-10 2021-05-11
6 A2 B3 2021-05-11 2021-05-12
7 A2 B4 2021-06-02 2021-06-03
8 A2 B4 2021-06-03 2021-06-04
Performance:
#4k rows
df = pd.concat([df] * 1000, ignore_index=True)
In [136]: %timeit jez(df)
16.9 ms ± 3.94 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [137]: %timeit andreas(df)
888 ms ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#800 rows
df = pd.concat([df] * 200, ignore_index=True)
In [139]: %timeit jez(df)
6.25 ms ± 46.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [140]: %timeit andreas(df)
170 ms ± 28.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
def andreas(df):
df['d_range'] = df.apply(lambda row: list(pd.date_range(start=row['Start'], end=row['End'])), axis=1)
return df.explode('d_range')
def jez(df):
df["Start"] = pd.to_datetime(df["Start"])
df["End"] = pd.to_datetime(df["End"])
#subtract values and convert to days
s = df["End"].sub(df["Start"]).dt.days
#repeat index
df = df.loc[df.index.repeat(s)].copy()
#add days by timedeltas, add 1 day for End column
add = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Start'] = df["Start"].add(add)
df['End'] = df["Start"] + pd.Timedelta(1, 'd')
#default index
return df.reset_index(drop=True)
You can create a list of dates and explode it:
df['d_range'] = df.apply(lambda row: list(pd.date_range(start=row['Start'], end=row['End'])), axis=1)
df = df.explode('d_range')
My dataframe has multiple values in a day. I want to extract the value which is from the last timestamp in a day.
Date_Timestamp Values
2010-01-01 11:00:00 2.5
2010-01-01 15:00:00 7.1
2010-01-01 23:59:00 11.1
2010-02-01 08:00:00 12.5
2010-02-01 17:00:00 37.1
2010-02-01 23:53:00 71.1
output:
Date_Timestamp Values
2010-01-01 23:59:00 11.1
2010-02-01 23:53:00 71.1
df['Date_Timestamp']=pd.to_datetime(df['Date_Timestamp'])
df.groupby(df['Date_Timestamp'].dt.date)['Values'].apply(lambda x: x.tail(1))
Use pandas.core.groupby.GroupBy.last
This is a vectorized method, that is incredibly fast, compared to .apply.
# given dataframe df with Date_Timestamp as a datetime
dfg = df.groupby(df.Date_Timestamp.dt.date).last().reset_index(drop=True)
# display(dfg)
Date_Timestamp Values
2010-01-01 23:59:00 11.1
2010-02-01 23:53:00 71.1
timeit test
import pandas as pd
import numpy as np
from datetime import datetime
# test data with 2M rows
np.random.seed(365)
rows = 2000000
df = pd.DataFrame({'datetime': pd.bdate_range(datetime(2020, 1, 1), freq='h', periods=rows).tolist(),
'values': np.random.rand(rows, )*1000})
# display(df.head())
datetime values
2020-01-01 00:00:00 941.455743
2020-01-01 01:00:00 641.602705
2020-01-01 02:00:00 684.610467
2020-01-01 03:00:00 588.562066
2020-01-01 04:00:00 543.887219
%%timeit
df.groupby(df.datetime.dt.date).last().reset_index(drop=True)
[out]:
100k: 39.8 ms ± 1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
200k: 80.7 ms ± 438 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
400k: 164 ms ± 659 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2M: 791 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# This answer, with apply, is terrible
# I let it run for 1.5 hours and it didn't finish
# I reran the test for this is 100k and 200k
%%timeit
df.groupby(df.datetime.dt.date)['values'].apply(lambda x: x.tail(1))
[out]:
100k: 2.42 s ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
200k: 8.77 s ± 328 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
400k: 38.2 s # I only did %%time instead of %%timeit - it takes to long
800k: 2min 54s
I have a Dataframe and I have one column in data frame name 'Pressure' it has repetitive value and I want categorize it. I have column like this
enter image description here
pressure
0.03
0.03
0.03
2.07
2.07
2.07
3.01
3.01
I have tried groupby() method but not able to make a segment column. I think is a easy way in panda can anybody help me in this.
I need an output like this
enter image description here
Pressue Segment
0.03 1
0.03 1
0.03 1
2.07 2
2.07 2
2.07 2
3.01 3
3.01 3
Thanks in advance
Use factorize if performance is important:
data["Segment"]= pd.factorize(data["pressure"])[0] + 1
print (data)
pressure Segment
0 0.03 1
1 0.03 1
2 0.03 1
3 2.07 2
4 2.07 2
5 2.07 2
6 3.01 3
7 3.01 3
Performance:
data = pd.DataFrame({'pressure': np.sort(np.random.randint(1000, size=10000)) / 100})
In [312]: %timeit data["pressure"].replace({j: i for i,j in enumerate(data["pressure"].unique(),start=1)}).astype("int")
141 ms ± 3.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [313]: %timeit pd.factorize(data["pressure"])[0] + 1
751 µs ± 3.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Create dict with unique values from a column pressure and label corresponding the same then use replace
d = {j: i for i,j in enumerate(data["Pressure"].unique(),start=1)}
data["Segment"]= data["Pressure"].replace(d).astype("int")
print(data)
Output:
Pressure Segment
0.03 1
0.03 1
0.03 1
2.07 2
2.07 2
2.07 2
3.01 3
3.01 3
I have the following pandas dataset of transactions, regarding a retail shop:
print(df)
product Date Assistant_name
product_1 2017-01-02 11:45:00 John
product_2 2017-01-02 11:45:00 John
product_3 2017-01-02 11:55:00 Mark
...
I would like to create the following dataset, for Market Basket Analysis:
product Date Assistant_name Invoice_number
product_1 2017-01-02 11:45:00 John 1
product_2 2017-01-02 11:45:00 John 1
product_3 2017-01-02 11:55:00 Mark 2
...
Briefly, if a transaction has the same Assistant_name and Date, I assume it does generate a new Invoice.
Simpliest is factorize with joined columns together:
df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
print (df)
product Date Assistant_name Invoice
0 product_1 2017-01-02 11:45:00 John 1
1 product_2 2017-01-02 11:45:00 John 1
2 product_3 2017-01-02 11:55:00 Mark 2
If performance is important use pd.lib.fast_zip:
df['Invoice']=pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0]+1
Timings:
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
In [178]: %%timeit
...: df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
...: df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
...:
9.16 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [179]: %%timeit
...: df['Invoice'] = pd.factorize(df['Date'].astype(str) + df['Assistant_name'])[0] + 1
...:
11.2 ms ± 395 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [180]: %%timeit
...: df['Invoice'] = pd.factorize(pd.lib.fast_zip([df.Date.values, df.Assistant_name.values]))[0] + 1
...:
6.27 ms ± 93.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using pandas categories is one way:
df['Invoice'] = list(zip(df['Date'], df['Assistant_name']))
df['Invoice'] = df['Invoice'].astype('category').cat.codes + 1
# product Date Assistant_name Invoice
# product_1 2017-01-02 11:45:00 John 1
# product_2 2017-01-02 11:45:00 John 1
# product_3 2017-01-02 11:55:00 Mark 2
The benefit of this method is you can easily retrieve a dictionary of mappings:
dict(enumerate(df['Invoice'].astype('category').cat.categories, 1))
# {1: ('11:45:00', 'John'), 2: ('11:55:00', 'Mark')}