Time interval between rows in a column, Python - python

Input
New Time
11:59:57
12:42:10
12:48:45
18:44:53
18:49:06
21:49:54
21:54:48
5:28:20
Below I wrote code to create interval in min.
import pandas as pd
import numpy as np
df = pd.read_csv(r"D:\test\test1.csv")
df['Interval in min'] = (pd.to_timedelta(df['New Time'].astype(str)).diff(1).dt.floor('T').dt.total_seconds().div(60))
print(df)
Output
New Time Interval in min
11:59:57 NaN
12:42:10 42.0
12:48:45 6.0
18:44:53 356.0
18:49:06 4.0
21:49:54 180.0
21:54:48 4.0
5:28:20 -987.0
Last interval in min i.e. -987 min is not correct, it should rather be 453 min (+1 day).

Assuming you want to consider a negative difference to be a new day, you could use:
s = pd.to_timedelta(df['New Time']).diff()
df['Interval in min'] = (s
.add(pd.to_timedelta(s.lt('0').cumsum(), unit='d'))
.dt.floor('T').dt.total_seconds().div(60)
)
output:
New Time Interval in min
0 11:59:57 NaN
1 12:42:10 42.0
2 12:48:45 6.0
3 18:44:53 356.0
4 18:49:06 4.0
5 21:49:54 180.0
6 21:54:48 4.0
7 5:28:20 453.0

Related

Pandas Aggregate Daily Data to Monthly Timeseries

I have a time series that looks like this (below)
And I want to resample it monthly, so it has 2019-10 is equal to the average of all the values of october, November is the average of all the PTS values for November, etc.
However, when i use the pd.resample('M').mean() method, if the final day for each month does not have a value, it fills in a Nan in my data frame. How do I solve this?
Date PTS
2019-10-23 14.0
2019-10-26 14.0
2019-10-27 8.0
2019-10-29 29.0
2019-10-31 17.0
2019-11-03 12.0
2019-11-05 2.0
2019-11-07 15.0
2019-11-08 7.0
2019-11-14 16.0
2019-11-16 12.0
2019-11-20 22.0
2019-11-22 9.0
2019-11-23 20.0
2019-11-25 18.0```
Would this work?
pd.resample('M').mean().dropna()
Do you have a code sample? This works:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
days = np.arange(31)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(days, 60),
"values": rng.integers(0, 60, size=60)})
data.set_index("dates", inplace=True)
# Set the last day to null.
data.loc["2019-03-31"] = np.nan
# This works
data.resample("M").mean()
It also works with an incomplete month:
incomplete_days = np.arange(10)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(incomplete_days, 10),
"values": rng.integers(0, 60, size=10)})
data.set_index("dates", inplace=True)
data.resample("M").mean()
You should check your data and types more thoroughly in case the NaN you're receiving indicates a more pressing issue.

How to find min and max time in Chat log conversation using pandas for calculating duration?

Want to calculate the duration of each ID and to write in the separate Columns
ID Ques Time Expected output
----------------------------------
11 Hi 11.21 1min
11 Hello 11.22
13 hey 12.11 10mins
13 what 12.22
14 so 01.01 2mins
14 ok 01.03
----------------------------------
Tried so far -
First_last_cover = English_Logs['Date'].agg(['min','max'])
print ("First Conversation and Last Conversation of the month", First_last_cover)
Here it is step by step. diff_mins column is the desired output -
import pandas as pd
df = pd.read_csv('C:/Users/FGB3140/Desktop/sample.csv')
df['Time_parsed'] = pd.to_datetime(df['Time'], format='%H.%M')
df['Time_parsed_shifted'] = df['Time_parsed'].shift(1)
df['diff'] = df['Time_parsed']-df['Time_parsed_shifted']
df['diff_mins'] = df['diff'].dt.seconds / 60
print(df)
Output
ID Ques Time Time_parsed Time_parsed_shifted diff \
0 11 Hi 11.21 1900-01-01 11:21:00 NaT NaT
1 11 Hello 11.22 1900-01-01 11:22:00 1900-01-01 11:21:00 00:01:00
2 13 hey 12.11 1900-01-01 12:11:00 1900-01-01 11:22:00 00:49:00
3 14 what 12.22 1900-01-01 12:22:00 1900-01-01 12:11:00 00:11:00
diff_mins
0 NaN
1 1.0
2 49.0
3 11.0
Explanation
Parse the Time column
Use .shift() to shift the column down a row - Time_parsed_shifted
Take the difference in time
Represent that difference in minutes
Assumption made:
The file contains time for eachID and Timemin and TimeMax is calculated.
The bellow code explains how to calculate the diff in the time and add as a new row
Assuming the data contains the DataFrame.
import pandas as pd
import dateutil.parser
Add a coulmn with duration calculating the time diffrence.
data['duration'] = data.apply(lambda x :(dateutil.parser.parse(x['TimeMax'])-dateutil.parser.parse(x['Timemin'])).total_seconds() , axis=1)

calculate acceleration given speed

Sorry if this seems like a stupid question,
I have a dataset which looks like this
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed
T 2017-10-07 10:44:48 28.750766667 77.088805000 783.5 0.0 2017-10-07_10-44-48 0.0 00:00:00
T 2017-10-07 10:44:58 28.752345000 77.087840000 853.5 7.8 198.70532 00:00:10
T 2017-10-07 10:45:00 28.752501667 77.087705000 854.5 7.7 220.53915 00:00:12
Im not exactly sure how to approach this,calculating acceleration requires taking difference of speed and time,any suggestions on what i may try?
Thanks in advance
Assuming your data was loaded from a CSV as follows:
type,time,latitude,longitude,altitude (m),speed (km/h),name,desc,currentdistance,timeelapsed
T,2017-10-07 10:44:48,28.750766667,77.088805000,783.5,0.0,2017-10-07_10-44-48,,0.0,00:00:00
T,2017-10-07 10:44:58,28.752345000,77.087840000,853.5,7.8,,,198.70532,00:00:10
T,2017-10-07 10:45:00,28.752501667,77.087705000,854.5,7.7,,,220.53915,00:00:12
The time column is converted to a datetime object, and the timeelapsed column is converted into seconds. From this you could add an acceleration column by
calculating the difference in speed (km/h) between each row and dividing by the difference in time between each row as follows:
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv', parse_dates=['time'], dtype={'name':str, 'desc':str})
df['timeelapsed'] = (pd.to_datetime(df['timeelapsed'], format='%H:%M:%S') - datetime(1900, 1, 1)).dt.total_seconds()
df['acceleration'] = (df['speed (km/h)'] - df['speed (km/h)'].shift(1)) / (df['timeelapsed'] - df['timeelapsed'].shift(1))
print df
Giving you:
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed acceleration
0 T 2017-10-07 10:44:48 28.750767 77.088805 783.5 0.0 2017-10-07_10-44-48 NaN 0.00000 0.0 NaN
1 T 2017-10-07 10:44:58 28.752345 77.087840 853.5 7.8 NaN NaN 198.70532 10.0 0.78
2 T 2017-10-07 10:45:00 28.752502 77.087705 854.5 7.7 NaN NaN 220.53915 12.0 -0.05

How to calculate total occupancy days for each day of year, given a dataframe of start and end dates?

I have a csv file and hence list or dataframe that contains start and end dates of visits to a campsite.
start_date end_date
0 2016-01-21 2016-01-24
1 2016-01-28 2016-01-29
2 2016-02-02 2016-02-10
3 2016-02-08 2016-02-12
...
I would like to calculate a dataframe with a row for each day in the time period, with a column calculating cumulative visitors, a column denoting number of visitors resident on that day and a cumulative sum of visitor days.
I currently have some hacky code that reads the visitor data into an ordinary python list visitor_array, and creates another list year_array for each date in the period/year. It then loops for each date in year_array with an inner loop over visitor_array and appends the current element of year_array with a count of new visitors and number of resident visitors on that day.
temp_day = datetime.date(2016,1,1)
year_array = [[temp_day + datetime.timedelta(days=d)] for d in range(365)]
for day in year_array:
new_visitors = 0
occupancy = 0
for visitor in visitor_array:
if visitor[0] = day:
new_visitors +=1
if (visitor[0] <= day[0]) and (day[0] <= visitor[1]):
occupancy +=1
day = day.append(new_visitors)
day = day.append(occupancy)
I then convert year_array into a pandas dataframe, create some cumsum columns and get busy plotting etc etc
Is there a more elegant pythonic/pandasic way of doing this all within pandas?
Considering df the dataframe with start/end values and d the final dataframe, I would have made something like this:
Code:
import numpy as np
import pandas as pd
import datetime
# ---- Create df sample
df = pd.DataFrame([['21/01/2016','24/01/2016'],
['28/01/2016','29/01/2016'],
['02/02/2016','10/02/2016'],
['08/02/2016','12/02/2016']], columns=['start','end'] )
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# ---- Create day index
temp_day = datetime.date(2016,1,1)
index = [(temp_day + datetime.timedelta(days=d)) for d in range(365)]
# ---- Create empty result df
# initialize df, set days as datetime in index
d = pd.DataFrame(np.zeros((365,3)),
index=pd.to_datetime(index),
columns=['new_visitor','occupancy','occupied_day'])
# ---- Iterate over df to fill d (final df)
for i, row in df.iterrows():
# Add 1 if first day for new visitor
d.loc[row.start,'new_visitor'] += 1
# 1 if some visitor in df.start, df.end
d.loc[row.start:row.end,'occupied_day'] = 1
# Add 1 for visitor occupancy these days
d.loc[row.start:row.end,'occupancy'] += 1
#cumulated days = some of occupied days
d['cumul_days'] = d.occupied_day.cumsum()
#cumulated visitors = some of occupancy
d['cumul_visitors'] = d.occupancy.cumsum()
Some extract of Resulting output print(d.loc['2016-01-21':'2016-01-29']):
index new_visitor occupancy occupied_day cumul_days cumul_visitors
2016-01-21 1.0 1.0 1.0 1.0 1.0
2016-01-22 0.0 1.0 0.0 1.0 2.0
2016-01-23 0.0 1.0 0.0 1.0 3.0
2016-01-24 0.0 1.0 0.0 1.0 4.0
2016-01-25 0.0 0.0 0.0 1.0 4.0
2016-01-26 0.0 0.0 0.0 1.0 4.0
2016-01-27 0.0 0.0 0.0 1.0 4.0
2016-01-28 1.0 1.0 1.0 2.0 5.0
2016-01-29 0.0 1.0 0.0 2.0 6.0
May this code helps!

How to fillna/missing values for an irregular timeseries for a Drug when Half-life is known

I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64

Categories

Resources