calculate time differences between consecutive rows using pandas? - python

I have a large csv file that has a date column. I want to calculate time differences between consecutive rows using pandas. how can I calculate time differences in seconds and write it in a new column? I already checked similar questions but their date format was different.this is top five rows of my data
2017-02-01T00:00:01
2017-02-01T00:00:01
2017-02-01T00:00:06
2017-02-01T00:00:07
2017-02-01T00:00:10
I tried
import pandas as pd
df=pd.read_csv('Output1.csv')
df['Time_diff'] = df['BaseDateTime'].diff()
print(df)
but got this error
TypeError Traceback (most recent call last)
<ipython-input-7-0dc1df27a3d2> in <module>
1 import pandas as pd
2 df=pd.read_csv('Output1.csv')
----> 3 df['Time_diff'] = df['BaseDateTime'].diff()
4 print(df)
D:\anaconda\lib\site-packages\pandas\core\series.py in diff(self, periods)
2356 dtype: float64
2357 """
-> 2358 result = algorithms.diff(self.array, periods)
2359 return self._constructor(result, index=self.index).__finalize__(self)
2360
D:\anaconda\lib\site-packages\pandas\core\algorithms.py in diff(arr, n, axis, stacklevel)
1924 out_arr[res_indexer] = arr[res_indexer] ^ arr[lag_indexer]
1925 else:
-> 1926 out_arr[res_indexer] = arr[res_indexer] - arr[lag_indexer]
1927
1928 if is_timedelta:
TypeError: unsupported operand type(s) for -: 'str' and 'str'`

Try this example:
import pandas as pd
import io
s = io.StringIO('''
dates,nums
2017-02-01T00:00:01,1
2017-02-01T00:00:01,2
2017-02-01T00:00:06,3
2017-02-01T00:00:07,4
2017-02-01T00:00:10,5
''')
df = pd.read_csv(s)
Currently the frame looks like this:
nums is nothing and just there to be a secondary column of "something".
dates nums
0 2017-02-01T00:00:01 1
1 2017-02-01T00:00:01 2
2 2017-02-01T00:00:06 3
3 2017-02-01T00:00:07 4
4 2017-02-01T00:00:10 5
Carrying on:
# format as datetime
df['dates'] = pd.to_datetime(df['dates'])
# shift the dates up and into a new column
df['dates_shift'] = df['dates'].shift(-1)
# work out the diff
df['time_diff'] = (df['dates_shift'] - df['dates']) / pd.Timedelta(seconds=1)
# remove the temp column
del df['dates_shift']
# see what you've got
print(df)
dates nums time_diff
0 2017-02-01 00:00:01 1 0.0
1 2017-02-01 00:00:01 2 5.0
2 2017-02-01 00:00:06 3 1.0
3 2017-02-01 00:00:07 4 3.0
4 2017-02-01 00:00:10 5 NaN
To get the absolute values change this line above:
df['time_diff'] = (df['dates_shift'] - df['dates']) / pd.Timedelta(seconds=1)
To:
df['time_diff'] = (df['dates_shift'] - df['dates']).abs() / pd.Timedelta(seconds=1)

Related

Elegant way to add years as timedelta units to shift dates - Pandas

I have a dataframe like as shown below
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
What I would like to do is add offset which is in years to the dates columns.
So, I was trying to convert the offset to timedelta object with unit=y or unit=Y and then shift admit_dates
df1['offset'] = pd.to_timedelta(df1['offset'],unit='Y') #also tried with `y` (small y)
df1['shifted_date'] = df1['admit_dates'] + df1['offset']
However, I get the below error
ValueError: Units 'M' and 'Y' are no longer supported, as they do not
represent unambiguous timedelta values durations.
Is there any other elegant way to shift dates by years?
The max Timestamp supported in pandas is Timestamp('2262-04-11 23:47:16.854775807') so you could not be able to add 310 years to date 12/31/2011, one possible way is to use python's datetime objects which support a max year upto 9999 so you should be able to add 310 years to that.
from dateutil.relativedelta import relativedelta
df['admit_dates'] = pd.to_datetime(df['admit_dates'])
df['admit_dates'] = df['admit_dates'].dt.date.add(
df['offset'].apply(lambda y: relativedelta(years=y)))
Result:
df
person_id admit_dates discharge_dates drug_start_dates offset
0 11 2238-03-21 05/09/2015 05/29/1967 223
1 11 2239-01-21 01/29/2016 01/21/1957 223
2 11 2241-07-20 7/27/2018 7/27/1959 223
3 21 2327-01-11 01/12/2017 01/01/1961 310
4 21 2321-12-31 01/31/2016 12/31/1961 310
One thing you can do is extract the year out of the date, and add it to the offset:
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[10,20,2,31,12]})
df1.admit_dates = pd.to_datetime(df1.admit_dates)
df1["new_year"] = df1.admit_dates.dt.year + df1.offset
df1["date_with_offset"] = pd.to_datetime(pd.DataFrame({"year": df1.new_year,
"month": df1.admit_dates.dt.month,
"day":df1.admit_dates.dt.day}))
The catch - with your original offsets, some of the dates cause the following error:
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2328-01-11 00:00:00
According to the documentation, the maximum date in pandas is Apr. 11th, 2262 (at about quarter to midnight, to be specific). It's probably because they keep time in nanoseconds, and that's when the out of bounds error occurs for this representation.
Units 'Y' and 'M' becomes deprecated since pandas 0.25.0
But thanks to numpy timedelta64 through which we can use these units in the pandas Timedelta
import pandas as pd
import numpy as np
# Builds your dataframe
df1 = pd.DataFrame({'person_id': [11,11,11,21,21],
'admit_dates': ['03/21/2015', '01/21/2016', '7/20/2018','01/11/2017','12/31/2011'],
'discharge_dates': ['05/09/2015', '01/29/2016', '7/27/2018','01/12/2017','01/31/2016'],
'drug_start_dates': ['05/29/1967', '01/21/1957', '7/27/1959','01/01/1961','12/31/1961'],
'offset':[223,223,223,310,310]})
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset
0 11 03/21/2015 05/09/2015 05/29/1967 223
1 11 01/21/2016 01/29/2016 01/21/1957 223
2 11 7/20/2018 7/27/2018 7/27/1959 223
3 21 01/11/2017 01/12/2017 01/01/1961 310
4 21 12/31/2011 01/31/2016 12/31/1961 310
>>> df1['shifted_date'] = df1.apply(lambda r: pd.Timedelta(np.timedelta64(r['offset'], 'Y'))+ pd.to_datetime(r['admit_dates']), axis=1)
>>> df1['shifted_date'] = df1['shifted_date'].dt.date
>>> df1
person_id admit_dates discharge_dates drug_start_dates offset shifted_date
0 11 03/21/2015 05/09/2015 05/29/1967 223 2238-03-21
1 11 01/21/2016 01/29/2016 01/21/1957 223 2239-01-21
2 11 7/20/2018 7/27/2018 7/27/1959 223 2241-07-20
....

Sort Values in DataFrame using Categorical Key without groupby Split Apply Combine

So... I have a Dataframe that looks like this, but much larger:
DATE ITEM STORE STOCK
0 2018-06-06 A L001 4
1 2018-06-06 A L002 0
2 2018-06-06 A L003 4
3 2018-06-06 B L001 1
4 2018-06-06 B L002 2
You can reproduce the same DataFrame with the following code:
import pandas as pd
import numpy as np
import itertools as it
lojas = ['L001', 'L002', 'L003']
itens = list("ABC")
dr = pd.date_range(start='2018-06-06', end='2018-06-12')
df = pd.DataFrame(data=list(it.product(dr, itens, lojas)), columns=['DATE', 'ITEM', 'STORE'])
df['STOCK'] = np.random.randint(0,5, size=len(df.ITEM))
I wanna calculate de STOCK difference between days in every pair ITEM-STORE and iterating over groups in a groupby object is easy using the function .diff() to get something like this:
DATE ITEM STORE STOCK DELTA
0 2018-06-06 A L001 4 NaN
9 2018-06-07 A L001 0 -4.0
18 2018-06-08 A L001 4 4.0
27 2018-06-09 A L001 0 -4.0
36 2018-06-10 A L001 3 3.0
45 2018-06-11 A L001 2 -1.0
54 2018-06-12 A L001 2 0.0
I´ve manage to do so by the following code:
gg = df.groupby([df.ITEM, df.STORE])
lg = []
for (name, group) in gg:
aux = group.copy()
aux.reset_index(drop=True, inplace=True)
aux['DELTA'] = aux.STOCK.diff().fillna(value=0, inplace=Tr
lg.append(aux)
df = pd.concat(lg)
But in a large DataFrame, it gets impracticable. Is there a faster more pythonic way to do this task?
I've tried to improve your groupby code, so this should be a lot faster.
v = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff()
df['DELTA'] = np.where(np.isnan(v), 0, v)
Some pointers/ideas here:
Don't iterate over groups
Don't pass series as the groupers if the series belong to the same DataFrame. Pass string labels instead.
diff can be vectorized
The last line is tantamount to a fillna, but fillna is slower than np.where
Specifying sort=False will prevent the output from being sorted by grouper keys, improving performance further
This can also be re-written as
df['DELTA'] = df.groupby(['ITEM', 'STORE'], sort=False).STOCK.diff().fillna(0)

Subtracting values across grouped data frames in Pandas

I have a set of IDs and Timestamps, and want to calculate the "total time elapsed per ID" by getting the difference of the oldest / earliest timestamps, grouped by ID.
Data
id timestamp
1 2018-02-01 03:00:00
1 2018-02-01 03:01:00
2 2018-02-02 10:03:00
2 2018-02-02 10:04:00
2 2018-02-02 11:05:00
Expected Result
(I want the delta converted to minutes)
id delta
1 1
2 62
I have a for loop, but it's very slow (10+ min for 1M+ rows). I was wondering if this was achievable via pandas functions?
# gb returns a DataFrameGroupedBy object, grouped by ID
gb = df.groupby(['id'])
# Create the resulting df
cycletime = pd.DataFrame(columns=['id','timeDeltaMin'])
def calculate_delta():
for id, groupdf in gb:
time = groupdf.timestamp
# returns timestamp rows for the current id
time_delta = time.max() - time.min()
# convert Timedelta object to minutes
time_delta = time_delta / pd.Timedelta(minutes=1)
# insert result to cycletime df
cycletime.loc[-1] = [id,time_delta]
cycletime.index += 1
Thinking of trying next:
- Multiprocessing
First ensure datetimes are OK:
df.timestamp = pd.to_datetime(df.timestamp)
Now find the number of minutes in the difference between the maximum and minimum for each id:
import numpy as np
>>> (df.timestamp.groupby(df.id).max() - df.timestamp.groupby(df.id).min()) / np.timedelta64(1, 'm')
id
1 1.0
2 62.0
Name: timestamp, dtype: float64
You can sort by id and tiemstamp, then groupby id and then find the difference between min and max timestamp per group.
df['timestamp'] = pd.to_datetime(df['timestamp'])
result = df.sort_values(['id']).groupby('id')['timestamp'].agg(['min', 'max'])
result['diff'] = (result['max']-result['min']) / np.timedelta64(1, 'm')
result.reset_index()[['id', 'diff']]
Output:
id diff
0 1 1.0
1 2 62.0
Another one:
import pandas as pd
import numpy as np
import datetime
ids = [1,1,2,2,2]
times = ['2018-02-01 03:00:00','2018-02-01 03:01:00','2018-02-02
10:03:00','2018-02-02 10:04:00','2018-02-02 11:05:00']
df = pd.DataFrame({'id':ids,'timestamp':pd.to_datetime(pd.Series(times))})
df.set_index('id', inplace=True)
print(df.groupby(level=0).diff().sum(level=0)['timestamp'].dt.seconds/60)

PANDAS Time Series Window Labels

I currently have a process for windowing time series data, but I am wondering if there is a vectorized, in-place approach for performance/resource reasons.
I have two lists that have the start and end dates of 30 day windows:
start_dts = [2014-01-01,...]
end_dts = [2014-01-30,...]
I have a dataframe with a field called 'transaction_dt'.
What I am trying accomplish is method to add two new columns ('start_dt' and 'end_dt') to each row when the transaction_dt is between a pair of 'start_dt' and 'end_dt' values. Ideally, this would be vectorized and in-place if possible.
EDIT:
As requested here is some sample data of my format:
'customer_id','transaction_dt','product','price','units'
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25
IIUC
By suing IntervalIndex
df2.index=pd.IntervalIndex.from_arrays(df2['Start'],df2['End'],closed='both')
df[['End','Start']]=df2.loc[df['transaction_dt']].values
df
Out[457]:
transaction_dt End Start
0 2017-01-02 2017-01-31 2017-01-01
1 2017-03-02 2017-03-31 2017-03-01
2 2017-04-02 2017-04-30 2017-04-01
3 2017-05-02 2017-05-31 2017-05-01
Data Input :
df=pd.DataFrame({'transaction_dt':['2017-01-02','2017-03-02','2017-04-02','2017-05-02']})
df['transaction_dt']=pd.to_datetime(df['transaction_dt'])
list1=['2017-01-01','2017-02-01','2017-03-01','2017-04-01','2017-05-01']
list2=['2017-01-31','2017-02-28','2017-03-31','2017-04-30','2017-05-31']
df2=pd.DataFrame({'Start':list1,'End':list2})
df2.Start=pd.to_datetime(df2.Start)
df2.End=pd.to_datetime(df2.End)
If you want start and end we can use this, Extracting the first day of month of a datetime type column in pandas:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-01-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
df["start"] = df['transaction_dt'].dt.floor('d') - pd.offsets.MonthBegin(1)
df["end"] = df['transaction_dt'].dt.floor('d') + pd.offsets.MonthEnd(1)
df
Returns
customer_id transaction_dt product price units start end
0 1 2004-01-02 thing1 25 47 2004-01-01 2004-01-31
1 1 2004-01-17 thing2 150 8 2004-01-01 2004-01-31
2 2 2004-01-29 thing2 150 25 2004-01-01 2004-01-31
new approach:
import io
import pandas as pd
import datetime
string = """customer_id,transaction_dt,product,price,units
1,2004-01-02,thing1,25,47
1,2004-01-17,thing2,150,8
2,2004-06-29,thing2,150,25"""
df = pd.read_csv(io.StringIO(string))
df["transaction_dt"] = pd.to_datetime(df["transaction_dt"])
# Get all timestamps that are necessary
# This assumes dates are sorted
# if not we should change [0] -> min_dt and [-1] --> max_dt
timestamps = [df.iloc[0]["transaction_dt"].floor('d') - pd.offsets.MonthBegin(1)]
while df.iloc[-1]["transaction_dt"].floor('d') > timestamps[-1]:
timestamps.append(timestamps[-1]+datetime.timedelta(days=30))
# We store all ranges here
ranges = list(zip(timestamps,timestamps[1:]))
# Loop through all values and add to column start and end
for ind,value in enumerate(df["transaction_dt"]):
for i,(start,end) in enumerate(ranges):
if (value >= start and value <= end):
df.loc[ind, "start"] = start
df.loc[ind, "end"] = end
# When match is found let's also
# remove all ranges that aren't met
# This can be removed if dates are not sorted
# But this should speed things up for large datasets
for _ in range(i):
ranges.pop(0)

Trying to create a new dataframe column in pandas based on a dataframe related if statement

I'm learning Python & pandas and practicing with different stock calculations. I've tried to search help with this but just haven't found a response similar enough or then didn't understand how to deduce the correct approach based on the previous responses.
I have read stock data of a given time frame with datareader into dataframe df. In df I have Date Volume and Adj Close columns which I want to use to create a new column "OBV" based on given criteria. OBV is a cumulative value that adds or subtracts the value of the volume today to the previous' days OBV depending on the adjusted close price.
The calculation of OBV is simple:
If Adj Close is higher today than Adj Close of yesterday then add the Volume of today to the (cumulative) volume of yesterday.
If Adj Close is lower today than Adj Close of yesterday then substract the Volume of today from the (cumulative) volume of yesterday.
On day 1 the OBV = 0
This is then repeated along the time frame and OBV gets accumulated.
Here's the basic imports and start
import numpy as np
import pandas as pd
import pandas_datareader
import datetime
from pandas_datareader import data, wb
start = datetime.date(2012, 4, 16)
end = datetime.date(2017, 4, 13)
# Reading in Yahoo Finance data with DataReader
df = data.DataReader('GOOG', 'yahoo', start, end)
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
#This is what I cannot get to work, and I've tried two different ways.
#ATTEMPT1
def obv1(column):
if column["Adj Close"] > column["Adj close"].shift(-1):
val = column["Volume"].shift(-1) + column["Volume"]
else:
val = column["Volume"].shift(-1) - column["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
#ATTEMPT 2
def obv1(df):
if df["Adj Close"] > df["Adj close"].shift(-1):
val = df["Volume"].shift(-1) + df["Volume"]
else:
val = df["Volume"].shift(-1) - df["Volume"]
return val
df["OBV"] = df.apply(obv1, axis=1)
Both give me an error.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(dict(
Volume=np.random.randint(100, 200, 10),
AdjClose=np.random.rand(10)
))
print(df)
AdjClose Volume
0 0.951710 111
1 0.346711 198
2 0.289758 174
3 0.662151 190
4 0.171633 115
5 0.018571 155
6 0.182415 113
7 0.332961 111
8 0.150202 113
9 0.810506 126
Multiply the Volume by -1 when change in AdjClose is negative. Then cumsum
(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum()
0 111
1 -87
2 -261
3 -71
4 -186
5 -341
6 -228
7 -117
8 -230
9 -104
dtype: int64
Include this along side the rest of the df
df.assign(new=(df.Volume * (~df.AdjClose.diff().le(0) * 2 - 1)).cumsum())
AdjClose Volume new
0 0.951710 111 111
1 0.346711 198 -87
2 0.289758 174 -261
3 0.662151 190 -71
4 0.171633 115 -186
5 0.018571 155 -341
6 0.182415 113 -228
7 0.332961 111 -117
8 0.150202 113 -230
9 0.810506 126 -104

Categories

Resources