Here's the setup:
I have two (integer-indexed) columns, start and month_delta. start has timestamps (its internal type is np.datetime64[ns]) and month_delta is integers.
I want to quickly produce the column that consists of the each datetime in start, offset by the corresponding number of months in month_delta. How do I do this?
Things I've tried that don't work:
apply is too slow.
You can't add a series of DateOffset objects to a series of datetime64[ns] dtype (or a DatetimeIndex).
You can't use a Series of timedelta64 objects either; Pandas silently converts month-based timedeltas to nanosecond-based timedeltas that are ~30 days long. (Yikes! What happened to not failing silently?)
Currently I'm iterating over all different values of month_delta and doing a tshift by that amount on the relevant part of a DatetimeIndex I created, but this is a horrible kludge:
new_dates = pd.Series(pd.Timestamp.now(), index=start.index)
date_index = pd.DatetimeIndex(start)
for i in xrange(month_delta.max()):
mask = (month_delta == i)
cur_dates = pd.Series(index=date_index[mask]).tshift(i, freq='M').index
new_dates[mask] = cur_dates
Yuck! Any suggestions?
Here is a way to do it (by adding NumPy datetime64s with timedelta64s) without calling apply:
import pandas as pd
import numpy as np
np.random.seed(1)
def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
years = np.asarray(years) - 1970
months = np.asarray(months) - 1
days = np.asarray(days) - 1
types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
'<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
vals = (years, months, days, weeks, hours, minutes, seconds,
milliseconds, microseconds, nanoseconds)
return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
if v is not None)
def year(dates):
"Return an array of the years given an array of datetime64s"
return dates.astype('M8[Y]').astype('i8') + 1970
def month(dates):
"Return an array of the months given an array of datetime64s"
return dates.astype('M8[M]').astype('i8') % 12 + 1
def day(dates):
"Return an array of the days of the month given an array of datetime64s"
return (dates - dates.astype('M8[M]')) / np.timedelta64(1, 'D') + 1
N = 10
df = pd.DataFrame({
'start': pd.date_range('2000-1-25', periods=N, freq='D'),
'months': np.random.randint(12, size=N)})
start = df['start'].values
df['new_date'] = combine64(year(start), months=month(start) + df['months'],
days=day(start))
print(df)
yields
months start new_date
0 5 2000-01-25 2000-06-25
1 11 2000-01-26 2000-12-26
2 8 2000-01-27 2000-09-27
3 9 2000-01-28 2000-10-28
4 11 2000-01-29 2000-12-29
5 5 2000-01-30 2000-06-30
6 0 2000-01-31 2000-01-31
7 0 2000-02-01 2000-02-01
8 1 2000-02-02 2000-03-02
9 7 2000-02-03 2000-09-03
i think something like this might work:
df['start'] = pd.to_datetime(df.start)
df.groupby('month_delta').apply(lambda x: x.start + pd.DateOffset(months=x.month_delta.iloc[0]))
there might be a better way to create a series of DateOffset objects and add that some way, but idk...
I was not able to find a way without at least using an apply for setup but assuming that is okay:
df = pandas.DataFrame(
[[datetime.date(2014,10,22), 1], [datetime.date(2014,11,20), 2]],
columns=['date','delta'])
>>> df
date delta
0 2014-10-22 1
1 2014-11-20 2
from dateutil.relativedelta import relativedelta
df['offset'] = df['delta'].apply(lambda x: relativedelta(months=x))
>>> df['date'] + df['offset']
0 2014-11-22
1 2015-01-20
Note that you must use the datetime from the datetime module rather than the numpy one or the pandas one. Since you are only creating the delta with the apply I would hope you experience a speedup.
Related
I have the a pandas dataframe in this format:
Dates
11-Feb-18
18-Feb-18
03-Mar-18
25-Mar-18
29-Mar-18
04-Apr-18
08-Apr-18
14-Apr-18
17-Apr-18
30-Apr-18
04-May-18
I want to find dates between two consecutive dates. In this example I want to make a new column which will contain dates between two consecutive dates. For example between 11-Feb-18 and 18-Feb-18, I will get all the dates between these two dates.
I tried this code but it's throwing me error:
pd.DataFrame({'dates':pd.date_range(pd.to_datetime(df_new['Time.[Day]'].loc[i].diff(-1)))})
if you want to add a column with the list of dates tat are missing in between, this shoudl work. This could be more efficient and it has to work around the NaT in the last row and becomes a bit longer as intended, but gives you the result.
import pandas as pd
from datetime import timedelta
test_df = pd.DataFrame({
"Dates" :
["11-Feb-18", "18-Feb-18", "03-Mar-18", "25-Mar-18", "29-Mar-18", "04-Apr-18",
"08-Apr-18", "14-Apr-18", "17-Apr-18", "30-Apr-18", "04-May-18"]
})
res = (
test_df
.assign(
# convert to datetime
Dates = lambda x : pd.to_datetime(x.Dates),
# get next rows date
Dates_next = lambda x : x.Dates.shift(-1),
# create the date range
Dates_list = lambda x : x.apply(
lambda x :
pd.date_range(
x.Dates + timedelta(days=1),
x.Dates_next - timedelta(days=1),
freq="D").date.tolist()
if pd.notnull(x.Dates_next)
else None
, axis = 1
))
)
print(res)
results in:
Dates Dates_next Dates_list
0 2018-02-11 2018-02-18 [2018-02-12, 2018-02-13, 2018-02-14, 2018-02-1...
1 2018-02-18 2018-03-03 [2018-02-19, 2018-02-20, 2018-02-21, 2018-02-2...
2 2018-03-03 2018-03-25 [2018-03-04, 2018-03-05, 2018-03-06, 2018-03-0...
3 2018-03-25 2018-03-29 [2018-03-26, 2018-03-27, 2018-03-28]
4 2018-03-29 2018-04-04 [2018-03-30, 2018-03-31, 2018-04-01, 2018-04-0...
5 2018-04-04 2018-04-08 [2018-04-05, 2018-04-06, 2018-04-07]
6 2018-04-08 2018-04-14 [2018-04-09, 2018-04-10, 2018-04-11, 2018-04-1...
7 2018-04-14 2018-04-17 [2018-04-15, 2018-04-16]
8 2018-04-17 2018-04-30 [2018-04-18, 2018-04-19, 2018-04-20, 2018-04-2...
9 2018-04-30 2018-05-04 [2018-05-01, 2018-05-02, 2018-05-03]
10 2018-05-04 NaT None
As a sidenote, if you don't need the last row after the analysis, you could filter out the last row after assigning the next date and eliminate the if statement to make it faster.
This works with dataframes, adding a new column containing the requested list
It iterates over the column 1, preparing a list of lists for column 2.
At the and it creates a new dataframe column and assigns the prepared values to it.
import pandas as pd
from pprint import pp
from datetime import datetime, timedelta
df = pd.read_csv("test.csv")
in_betweens = []
for i in range(len(df["dates"])-1):
d = datetime.strptime(df["dates"][i],"%d-%b-%y")
d2 = datetime.strptime(df["dates"][i+1],"%d-%b-%y")
d = d + timedelta(days=1)
in_between = []
while d < d2:
in_between.append(d.strftime("%d-%b-%y"))
d = d + timedelta(days=1)
in_betweens.append(in_between)
in_betweens.append([])
df["in_betwens"] = in_betweens
df.head()
I am trying to prefill a dataframe akin to:
In the sample I am randomly removing some rows to highlight the challenge. I am trying to *elegantly calculate the dti value. The dti value in the first row would be 0 (even if first row is deleted as per script) but as gaps appear in the dti sequence needs to skip the missing rows. A logical approach would be to divide dt/delta to create a unique integer representing the bucket but nothing I tried felt or seemed elegant.
A bit of code to help simulate the problem:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
start = datetime.now()
nin = 24
delta='4H'
df = pd.date_range( start, periods=nin, freq=deltadf, name ='dt')
# remove some random data points
frac_points = 8/24 # Fraction of points to retain
r = np.random.rand(nin)
df = df[r <= frac_points] # reduce the number of points
df = df.to_frame(index=False) # reindex
df['dti'] = ...
Thank you in advance,
One solution is to divide the time differences between each row by the timedelta:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
start = datetime.now()
nin = 24
delta='4H'
df = pd.date_range(start, periods=nin, freq=delta, name='dt')
# Round to nearest ten minutes for better readability
df = df.round('10min')
# Ensure reproducibility
np.random.seed(1)
# remove some random data points
frac_points = 8/24 # Fraction of points to retain
r = np.random.rand(nin)
df = df[r <= frac_points] # reduce the number of points
df = df.to_frame(index=False) # reindex
df['dti'] = df['dt'].diff() / pd.to_timedelta(delta)
df['dti'] = df['dti'].fillna(0).cumsum().astype(int)
df
dt dti
0 2019-03-17 18:10:00 0
1 2019-03-17 22:10:00 1
2 2019-03-18 02:10:00 2
3 2019-03-18 06:10:00 3
4 2019-03-18 10:10:00 4
5 2019-03-19 10:10:00 10
6 2019-03-19 18:10:00 12
7 2019-03-20 10:10:00 16
8 2019-03-20 14:10:00 17
9 2019-03-21 02:10:00 20
I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)
I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)
I have a pandas DataFrame with a column "StartTime" that could be any datetime value. I would like to create a second column that gives the StartTime relative to the beginning of the week (i.e., 12am on the previous Sunday). For example, this post is 5 days, 14 hours since the beginning of this week.
StartTime
1 2007-01-19 15:59:24
2 2007-03-01 04:16:08
3 2006-11-08 20:47:14
4 2008-09-06 23:57:35
5 2007-02-17 18:57:32
6 2006-12-09 12:30:49
7 2006-11-11 11:21:34
I can do this, but it's pretty dang slow:
def time_since_week_beg(x):
y = x.to_datetime()
return pd.Timedelta(days=y.weekday(),
hours=y.hour,
minutes=y.minute,
seconds=y.second
)
df['dt'] = df.StartTime.apply(time_since_week_beg)
What I want is something like this, that doesn't result in an error:
df['dt'] = pd.Timedelta(days=df.StartTime.dt.dayofweek,
hours=df.StartTime.dt.hour,
minute=df.StartTime.dt.minute,
second=df.StartTime.dt.second
)
TypeError: Invalid type <class 'pandas.core.series.Series'>. Must be int or float.
Any thoughts?
You can use a list comprehension:
df['dt'] = [pd.Timedelta(days=ts.dayofweek,
hours=ts.hour,
minutes=ts.minute,
seconds=ts.second)
for ts in df.StartTime]
>>> df
StartTime dt
0 2007-01-19 15:59:24 4 days 15:59:24
1 2007-03-01 04:16:08 3 days 04:16:08
2 2006-11-08 20:47:14 2 days 20:47:14
3 2008-09-06 23:57:35 5 days 23:57:35
4 2007-02-17 18:57:32 5 days 18:57:32
5 2006-12-09 12:30:49 5 days 12:30:49
6 2006-11-11 11:21:34 5 days 11:21:34
Depending on the format of StartTime, you may need:
...for ts in pd.to_datetime(df.StartTime)