Elegant pandas pre-fill using date_range with various possible freq settings - python

I am trying to prefill a dataframe akin to:
In the sample I am randomly removing some rows to highlight the challenge. I am trying to *elegantly calculate the dti value. The dti value in the first row would be 0 (even if first row is deleted as per script) but as gaps appear in the dti sequence needs to skip the missing rows. A logical approach would be to divide dt/delta to create a unique integer representing the bucket but nothing I tried felt or seemed elegant.
A bit of code to help simulate the problem:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
start = datetime.now()
nin = 24
delta='4H'
df = pd.date_range( start, periods=nin, freq=deltadf, name ='dt')
# remove some random data points
frac_points = 8/24 # Fraction of points to retain
r = np.random.rand(nin)
df = df[r <= frac_points] # reduce the number of points
df = df.to_frame(index=False) # reindex
df['dti'] = ...
Thank you in advance,

One solution is to divide the time differences between each row by the timedelta:
from datetime import datetime, timedelta
import pandas as pd
import numpy as np
start = datetime.now()
nin = 24
delta='4H'
df = pd.date_range(start, periods=nin, freq=delta, name='dt')
# Round to nearest ten minutes for better readability
df = df.round('10min')
# Ensure reproducibility
np.random.seed(1)
# remove some random data points
frac_points = 8/24 # Fraction of points to retain
r = np.random.rand(nin)
df = df[r <= frac_points] # reduce the number of points
df = df.to_frame(index=False) # reindex
df['dti'] = df['dt'].diff() / pd.to_timedelta(delta)
df['dti'] = df['dti'].fillna(0).cumsum().astype(int)
df
dt dti
0 2019-03-17 18:10:00 0
1 2019-03-17 22:10:00 1
2 2019-03-18 02:10:00 2
3 2019-03-18 06:10:00 3
4 2019-03-18 10:10:00 4
5 2019-03-19 10:10:00 10
6 2019-03-19 18:10:00 12
7 2019-03-20 10:10:00 16
8 2019-03-20 14:10:00 17
9 2019-03-21 02:10:00 20

Related

Pandas timeseries replace weekend with one value generated from the weekend mean

I have a multicolumn pandas dataframe, with rows for each day.
Now I would like to replace each weekend with it's mean values in one row.
I.e. (Fr,Sa,Su).resample().mean() --> (Weekend)
Not sure where to start even.
Thank you in advance.
import pandas as pd
from datetime import timedelta
# make some data
df = pd.DataFrame({'dt': pd.date_range("2018-11-27", "2018-12-12"), "val": range(0,16)})
# adjust the weekend dates to fall on the friday
df['shifted'] = [d - timedelta(days = max(d.weekday() - 4, 0)) for d in df['dt']]
# calc the mean
df2 = df.groupby(df['shifted']).val.mean()
df2
#Out[105]:
#shifted
#2018-11-27 0
#2018-11-28 1
#2018-11-29 2
#2018-11-30 4
#2018-12-03 6
#2018-12-04 7
#2018-12-05 8
#2018-12-06 9
#2018-12-07 11
#2018-12-10 13
#2018-12-11 14
#2018-12-12 15

Subtracting values across grouped data frames in Pandas

I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)

PANDAS Time Series Window Labels

I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)

interpolate/extrapolate missing dates in python?

lets say i have the following dataframe
bb = pd.DataFrame(data = {'date' :['','','','2015-09-02', '2015-09-02', '2015-09-03','','2015-09-08', '', '2015-09-11','2015-09-14','','' ]})
bb['date'] = pd.to_datetime(bb['date'], format="%Y-%m-%d")
I want to interpolate and exptrapolate linearly to fill the missing date values . I used the following code but it doesn't change anything. I am new to pandas. please help
bb= bb.interpolate(method='time')
To extrapolate you have to use bfill() and ffill(). Missing values will be assigned by back- (or forward) values.
To linear interpolate you have to use function interpolate but dates need to convert to numbers:
import numpy as np
import pandas as pd
from datetime import datetime
bb = pd.DataFrame(data = {'date' :['','','','2015-09-02', '2015-09-02', '2015-09-03','','2015-09-08', '', '2015-09-11','2015-09-14','','' ]})
bb['date'] = pd.to_datetime(bb['date'], format="%Y-%m-%d")
# convert to seconds
tmp = bb['date'].apply(lambda t: (t-datetime(1970,1,1)).total_seconds())
# linear interpolation
tmp.interpolate(inplace=True)
# back convert to dates
bb['date'] = pd.to_datetime(tmp, unit='s')
bb['date'] = bb['date'].apply(lambda t: t.date())
# extrapolation for the first missing values
bb.bfill(inplace='True')
print bb
Result:
date
0 2015-09-02
1 2015-09-02
2 2015-09-02
3 2015-09-02
4 2015-09-02
5 2015-09-03
6 2015-09-05
7 2015-09-08
8 2015-09-09
9 2015-09-11
10 2015-09-14
11 2015-09-14
12 2015-09-14

Pandas: Quickly add variable number of months to a timestamp column

Here's the setup:
I have two (integer-indexed) columns, start and month_delta. start has timestamps (its internal type is np.datetime64[ns]) and month_delta is integers.
I want to quickly produce the column that consists of the each datetime in start, offset by the corresponding number of months in month_delta. How do I do this?
Things I've tried that don't work:
apply is too slow.
You can't add a series of DateOffset objects to a series of datetime64[ns] dtype (or a DatetimeIndex).
You can't use a Series of timedelta64 objects either; Pandas silently converts month-based timedeltas to nanosecond-based timedeltas that are ~30 days long. (Yikes! What happened to not failing silently?)
Currently I'm iterating over all different values of month_delta and doing a tshift by that amount on the relevant part of a DatetimeIndex I created, but this is a horrible kludge:
new_dates = pd.Series(pd.Timestamp.now(), index=start.index)
date_index = pd.DatetimeIndex(start)
for i in xrange(month_delta.max()):
mask = (month_delta == i)
cur_dates = pd.Series(index=date_index[mask]).tshift(i, freq='M').index
new_dates[mask] = cur_dates
Yuck! Any suggestions?
Here is a way to do it (by adding NumPy datetime64s with timedelta64s) without calling apply:
import pandas as pd
import numpy as np
np.random.seed(1)
def combine64(years, months=1, days=1, weeks=None, hours=None, minutes=None,
seconds=None, milliseconds=None, microseconds=None, nanoseconds=None):
years = np.asarray(years) - 1970
months = np.asarray(months) - 1
days = np.asarray(days) - 1
types = ('<M8[Y]', '<m8[M]', '<m8[D]', '<m8[W]', '<m8[h]',
'<m8[m]', '<m8[s]', '<m8[ms]', '<m8[us]', '<m8[ns]')
vals = (years, months, days, weeks, hours, minutes, seconds,
milliseconds, microseconds, nanoseconds)
return sum(np.asarray(v, dtype=t) for t, v in zip(types, vals)
if v is not None)
def year(dates):
"Return an array of the years given an array of datetime64s"
return dates.astype('M8[Y]').astype('i8') + 1970
def month(dates):
"Return an array of the months given an array of datetime64s"
return dates.astype('M8[M]').astype('i8') % 12 + 1
def day(dates):
"Return an array of the days of the month given an array of datetime64s"
return (dates - dates.astype('M8[M]')) / np.timedelta64(1, 'D') + 1
N = 10
df = pd.DataFrame({
'start': pd.date_range('2000-1-25', periods=N, freq='D'),
'months': np.random.randint(12, size=N)})
start = df['start'].values
df['new_date'] = combine64(year(start), months=month(start) + df['months'],
days=day(start))
print(df)
yields
months start new_date
0 5 2000-01-25 2000-06-25
1 11 2000-01-26 2000-12-26
2 8 2000-01-27 2000-09-27
3 9 2000-01-28 2000-10-28
4 11 2000-01-29 2000-12-29
5 5 2000-01-30 2000-06-30
6 0 2000-01-31 2000-01-31
7 0 2000-02-01 2000-02-01
8 1 2000-02-02 2000-03-02
9 7 2000-02-03 2000-09-03
i think something like this might work:
df['start'] = pd.to_datetime(df.start)
df.groupby('month_delta').apply(lambda x: x.start + pd.DateOffset(months=x.month_delta.iloc[0]))
there might be a better way to create a series of DateOffset objects and add that some way, but idk...
I was not able to find a way without at least using an apply for setup but assuming that is okay:
df = pandas.DataFrame(
[[datetime.date(2014,10,22), 1], [datetime.date(2014,11,20), 2]],
columns=['date','delta'])
>>> df
date delta
0 2014-10-22 1
1 2014-11-20 2
from dateutil.relativedelta import relativedelta
df['offset'] = df['delta'].apply(lambda x: relativedelta(months=x))
>>> df['date'] + df['offset']
0 2014-11-22
1 2015-01-20
Note that you must use the datetime from the datetime module rather than the numpy one or the pandas one. Since you are only creating the delta with the apply I would hope you experience a speedup.

Categories

Resources