Input
New Time
11:59:57
12:42:10
12:48:45
18:44:53
18:49:06
21:49:54
21:54:48
5:28:20
Below I wrote code to create interval in min.
import pandas as pd
import numpy as np
df = pd.read_csv(r"D:\test\test1.csv")
df['Interval in min'] = (pd.to_timedelta(df['New Time'].astype(str)).diff(1).dt.floor('T').dt.total_seconds().div(60))
print(df)
Output
New Time Interval in min
11:59:57 NaN
12:42:10 42.0
12:48:45 6.0
18:44:53 356.0
18:49:06 4.0
21:49:54 180.0
21:54:48 4.0
5:28:20 -987.0
Last interval in min i.e. -987 min is not correct, it should rather be 453 min (+1 day).
Assuming you want to consider a negative difference to be a new day, you could use:
s = pd.to_timedelta(df['New Time']).diff()
df['Interval in min'] = (s
.add(pd.to_timedelta(s.lt('0').cumsum(), unit='d'))
.dt.floor('T').dt.total_seconds().div(60)
)
output:
New Time Interval in min
0 11:59:57 NaN
1 12:42:10 42.0
2 12:48:45 6.0
3 18:44:53 356.0
4 18:49:06 4.0
5 21:49:54 180.0
6 21:54:48 4.0
7 5:28:20 453.0
Related
I have a time series that looks like this (below)
And I want to resample it monthly, so it has 2019-10 is equal to the average of all the values of october, November is the average of all the PTS values for November, etc.
However, when i use the pd.resample('M').mean() method, if the final day for each month does not have a value, it fills in a Nan in my data frame. How do I solve this?
Date PTS
2019-10-23 14.0
2019-10-26 14.0
2019-10-27 8.0
2019-10-29 29.0
2019-10-31 17.0
2019-11-03 12.0
2019-11-05 2.0
2019-11-07 15.0
2019-11-08 7.0
2019-11-14 16.0
2019-11-16 12.0
2019-11-20 22.0
2019-11-22 9.0
2019-11-23 20.0
2019-11-25 18.0```
Would this work?
pd.resample('M').mean().dropna()
Do you have a code sample? This works:
import pandas as pd
import numpy as np
rng = np.random.default_rng()
days = np.arange(31)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(days, 60),
"values": rng.integers(0, 60, size=60)})
data.set_index("dates", inplace=True)
# Set the last day to null.
data.loc["2019-03-31"] = np.nan
# This works
data.resample("M").mean()
It also works with an incomplete month:
incomplete_days = np.arange(10)
data = pd.DataFrame({"dates": np.datetime64("2019-03-01") + rng.choice(incomplete_days, 10),
"values": rng.integers(0, 60, size=10)})
data.set_index("dates", inplace=True)
data.resample("M").mean()
You should check your data and types more thoroughly in case the NaN you're receiving indicates a more pressing issue.
Want to calculate the duration of each ID and to write in the separate Columns
ID Ques Time Expected output
----------------------------------
11 Hi 11.21 1min
11 Hello 11.22
13 hey 12.11 10mins
13 what 12.22
14 so 01.01 2mins
14 ok 01.03
----------------------------------
Tried so far -
First_last_cover = English_Logs['Date'].agg(['min','max'])
print ("First Conversation and Last Conversation of the month", First_last_cover)
Here it is step by step. diff_mins column is the desired output -
import pandas as pd
df = pd.read_csv('C:/Users/FGB3140/Desktop/sample.csv')
df['Time_parsed'] = pd.to_datetime(df['Time'], format='%H.%M')
df['Time_parsed_shifted'] = df['Time_parsed'].shift(1)
df['diff'] = df['Time_parsed']-df['Time_parsed_shifted']
df['diff_mins'] = df['diff'].dt.seconds / 60
print(df)
Output
ID Ques Time Time_parsed Time_parsed_shifted diff \
0 11 Hi 11.21 1900-01-01 11:21:00 NaT NaT
1 11 Hello 11.22 1900-01-01 11:22:00 1900-01-01 11:21:00 00:01:00
2 13 hey 12.11 1900-01-01 12:11:00 1900-01-01 11:22:00 00:49:00
3 14 what 12.22 1900-01-01 12:22:00 1900-01-01 12:11:00 00:11:00
diff_mins
0 NaN
1 1.0
2 49.0
3 11.0
Explanation
Parse the Time column
Use .shift() to shift the column down a row - Time_parsed_shifted
Take the difference in time
Represent that difference in minutes
Assumption made:
The file contains time for eachID and Timemin and TimeMax is calculated.
The bellow code explains how to calculate the diff in the time and add as a new row
Assuming the data contains the DataFrame.
import pandas as pd
import dateutil.parser
Add a coulmn with duration calculating the time diffrence.
data['duration'] = data.apply(lambda x :(dateutil.parser.parse(x['TimeMax'])-dateutil.parser.parse(x['Timemin'])).total_seconds() , axis=1)
Sorry if this seems like a stupid question,
I have a dataset which looks like this
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed
T 2017-10-07 10:44:48 28.750766667 77.088805000 783.5 0.0 2017-10-07_10-44-48 0.0 00:00:00
T 2017-10-07 10:44:58 28.752345000 77.087840000 853.5 7.8 198.70532 00:00:10
T 2017-10-07 10:45:00 28.752501667 77.087705000 854.5 7.7 220.53915 00:00:12
Im not exactly sure how to approach this,calculating acceleration requires taking difference of speed and time,any suggestions on what i may try?
Thanks in advance
Assuming your data was loaded from a CSV as follows:
type,time,latitude,longitude,altitude (m),speed (km/h),name,desc,currentdistance,timeelapsed
T,2017-10-07 10:44:48,28.750766667,77.088805000,783.5,0.0,2017-10-07_10-44-48,,0.0,00:00:00
T,2017-10-07 10:44:58,28.752345000,77.087840000,853.5,7.8,,,198.70532,00:00:10
T,2017-10-07 10:45:00,28.752501667,77.087705000,854.5,7.7,,,220.53915,00:00:12
The time column is converted to a datetime object, and the timeelapsed column is converted into seconds. From this you could add an acceleration column by
calculating the difference in speed (km/h) between each row and dividing by the difference in time between each row as follows:
from datetime import datetime
import pandas as pd
import numpy as np
df = pd.read_csv('input.csv', parse_dates=['time'], dtype={'name':str, 'desc':str})
df['timeelapsed'] = (pd.to_datetime(df['timeelapsed'], format='%H:%M:%S') - datetime(1900, 1, 1)).dt.total_seconds()
df['acceleration'] = (df['speed (km/h)'] - df['speed (km/h)'].shift(1)) / (df['timeelapsed'] - df['timeelapsed'].shift(1))
print df
Giving you:
type time latitude longitude altitude (m) speed (km/h) name desc currentdistance timeelapsed acceleration
0 T 2017-10-07 10:44:48 28.750767 77.088805 783.5 0.0 2017-10-07_10-44-48 NaN 0.00000 0.0 NaN
1 T 2017-10-07 10:44:58 28.752345 77.087840 853.5 7.8 NaN NaN 198.70532 10.0 0.78
2 T 2017-10-07 10:45:00 28.752502 77.087705 854.5 7.7 NaN NaN 220.53915 12.0 -0.05
I have a csv file and hence list or dataframe that contains start and end dates of visits to a campsite.
start_date end_date
0 2016-01-21 2016-01-24
1 2016-01-28 2016-01-29
2 2016-02-02 2016-02-10
3 2016-02-08 2016-02-12
...
I would like to calculate a dataframe with a row for each day in the time period, with a column calculating cumulative visitors, a column denoting number of visitors resident on that day and a cumulative sum of visitor days.
I currently have some hacky code that reads the visitor data into an ordinary python list visitor_array, and creates another list year_array for each date in the period/year. It then loops for each date in year_array with an inner loop over visitor_array and appends the current element of year_array with a count of new visitors and number of resident visitors on that day.
temp_day = datetime.date(2016,1,1)
year_array = [[temp_day + datetime.timedelta(days=d)] for d in range(365)]
for day in year_array:
new_visitors = 0
occupancy = 0
for visitor in visitor_array:
if visitor[0] = day:
new_visitors +=1
if (visitor[0] <= day[0]) and (day[0] <= visitor[1]):
occupancy +=1
day = day.append(new_visitors)
day = day.append(occupancy)
I then convert year_array into a pandas dataframe, create some cumsum columns and get busy plotting etc etc
Is there a more elegant pythonic/pandasic way of doing this all within pandas?
Considering df the dataframe with start/end values and d the final dataframe, I would have made something like this:
Code:
import numpy as np
import pandas as pd
import datetime
# ---- Create df sample
df = pd.DataFrame([['21/01/2016','24/01/2016'],
['28/01/2016','29/01/2016'],
['02/02/2016','10/02/2016'],
['08/02/2016','12/02/2016']], columns=['start','end'] )
df['start'] = pd.to_datetime(df['start'])
df['end'] = pd.to_datetime(df['end'])
# ---- Create day index
temp_day = datetime.date(2016,1,1)
index = [(temp_day + datetime.timedelta(days=d)) for d in range(365)]
# ---- Create empty result df
# initialize df, set days as datetime in index
d = pd.DataFrame(np.zeros((365,3)),
index=pd.to_datetime(index),
columns=['new_visitor','occupancy','occupied_day'])
# ---- Iterate over df to fill d (final df)
for i, row in df.iterrows():
# Add 1 if first day for new visitor
d.loc[row.start,'new_visitor'] += 1
# 1 if some visitor in df.start, df.end
d.loc[row.start:row.end,'occupied_day'] = 1
# Add 1 for visitor occupancy these days
d.loc[row.start:row.end,'occupancy'] += 1
#cumulated days = some of occupied days
d['cumul_days'] = d.occupied_day.cumsum()
#cumulated visitors = some of occupancy
d['cumul_visitors'] = d.occupancy.cumsum()
Some extract of Resulting output print(d.loc['2016-01-21':'2016-01-29']):
index new_visitor occupancy occupied_day cumul_days cumul_visitors
2016-01-21 1.0 1.0 1.0 1.0 1.0
2016-01-22 0.0 1.0 0.0 1.0 2.0
2016-01-23 0.0 1.0 0.0 1.0 3.0
2016-01-24 0.0 1.0 0.0 1.0 4.0
2016-01-25 0.0 0.0 0.0 1.0 4.0
2016-01-26 0.0 0.0 0.0 1.0 4.0
2016-01-27 0.0 0.0 0.0 1.0 4.0
2016-01-28 1.0 1.0 1.0 2.0 5.0
2016-01-29 0.0 1.0 0.0 2.0 6.0
May this code helps!
I have a dataframe (df) where column A is drug units that is dosed at time point given by Timestamp. I want to fill the missing values (NaN) with the drug concentration given the half-life of the drug (180mins). I am struggling with the code in pandas . Would really appreciate help and insight. Thanks in advance
df
A
Timestamp
1991-04-21 09:09:00 9.0
1991-04-21 3:00:00 NaN
1991-04-21 9:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN
Given the half -life of the drug is 180 mins. I wanted to fillna(values) as a function of time elapsed and the half life of the drug
something like
Timestamp A
1991-04-21 09:00:00 9.0
1991-04-21 3:00:00 ~2.25
1991-04-21 9:00:00 ~0.55
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 ~2.5
1991-04-22 16:56:00 ~0.75
Your timestamps are not sorted and I'm assuming this was a typo. I fixed it below.
import pandas as pd
import numpy as np
from StringIO import StringIO
text = """TimeStamp A
1991-04-21 09:09:00 9.0
1991-04-21 13:00:00 NaN
1991-04-21 19:00:00 NaN
1991-04-22 07:35:00 10.0
1991-04-22 13:40:00 NaN
1991-04-22 16:56:00 NaN """
df = pd.read_csv(StringIO(text), sep='\s{2,}', engine='python', parse_dates=[0])
This is the magic code.
# half-life of 180 minutes is 10,800 seconds
# we need to calculate lamda (intentionally mis-spelled)
lamda = 10800 / np.log(2)
# returns time difference for each element
# relative to first element
def time_diff(x):
return x - x.iloc[0]
# create partition of non-nulls with subsequent nulls
partition = df.A.notnull().cumsum()
# calculate time differences in seconds for each
# element relative to most recent non-null observation
# use .dt accessor and method .total_seconds()
tdiffs = df.TimeStamp.groupby(partition).apply(time_diff).dt.total_seconds()
# apply exponential decay
decay = np.exp(-tdiffs / lamda)
# finally, forward fill the observations and multiply by decay
decay * df.A.ffill()
0 9.000000
1 3.697606
2 0.924402
3 10.000000
4 2.452325
5 1.152895
dtype: float64