Python: Working with columns inside a pandas Dataframe - python

Good evening,
is it possible to calculate with - let's say - two columns inside a dataframe and add a third column with the fitting result?
Dataframe (original):
name time_a time_b
name_a 08:00:00 09:00:00
name_b 07:45:00 08:15:00
name_c 07:00:00 08:10:00
name_d 06:00:00 10:00:00
Or to be specific...is it possible to obtain the difference of two times (time_b - time_a) and create a
new column (time_c) at the end of the dataframe?
Dataframe (new):
name time_a time_b time_c
name_a 08:00:00 09:00:00 01:00:00
name_b 07:45:00 08:15:00 00:30:00
name_c 07:00:00 08:10:00 01:10:00
name_d 06:00:00 10:00:00 04:00:00
Thanks and a good night!

If your columns are in datetime or timedelta format:
# New column is a timedelta object
df["time_c"] = (df["time_b"] - df["time_a"])
If your columns are in datetime.time format (which it appears they are):
def time_diff(time_1,time_2):
"""returns the difference between time 1 and time 2 (time_2-time_1)"""
now = datetime.datetime.now()
time_1 = datetime.datetime.combine(now,time_1)
time_2 = datetime.datetime.combine(now,time_2)
return time_2 - time_1
# Apply the function
df["time_c"] = df[["time_a","time_b"]].apply(lambda arr: time_diff(*arr), axis=1)
Alternatively, you can convert to a timedelta by first converting to a string:
df["time_a"]=pd.to_timedelta(df["time_a"].astype(str))
df["time_b"]=pd.to_timedelta(df["time_b"].astype(str))
df["time_c"] = df["time_b"] - df["time_a"]

Related

Dataframe from Series grouped by weekday and hour of day

I have a Series with a DatetimeIndex, as such :
time my_values
2017-12-20 09:00:00 0.005611
2017-12-20 10:00:00 -0.004704
2017-12-20 11:00:00 0.002980
2017-12-20 12:00:00 0.001497
...
2021-08-20 13:00:00 -0.001084
2021-08-20 14:00:00 -0.001608
2021-08-20 15:00:00 -0.002182
2021-08-20 16:00:00 -0.012891
2021-08-20 17:00:00 0.002711
I would like to create a dataframe of average values with the weekdays as columns names and hour of the day as index, resulting in this :
hour Monday Tuesday ... Sunday
0 0.005611 -0.001083 -0.003467
1 -0.004704 0.003362 -0.002357
2 0.002980 0.019443 0.009814
3 0.001497 -0.002967 -0.003466
...
19 -0.001084 0.009822 0.003362
20 -0.001608 -0.002967 -0.003567
21 -0.002182 0.035600 -0.003865
22 -0.012891 0.002945 -0.002345
23 0.002711 -0.002458 0.006467
How can do this in Python ?
Do as follows
# Coerce time to datetime
df['time'] = pd.to_datetime(df['time'])
# Extract day and hour
df = df.assign(day=df['time'].dt.strftime('%A'), hour=df['time'].dt.hour)
# Pivot
pd.pivot_table(df, values='my_values', index=['hour'],
columns=['day'], aggfunc=np.mean)
Since you asked for a solution that returns the average values, I propose this groupby solution
df["weekday"] = df.time.dt.strftime('%A')
df["hour"] = df.time.dt.strftime('%H')
df = df.drop(["time"], axis=1)
# calculate averages by weekday and hour
df2 = df.groupby(["hour", "weekday"]).mean()
# put it in the right format
df2.unstack()

How do subtraction between timestamp two rows per two with shift - Pandas Python

I would like to make a subtraction with date_time in pandas python but with a shift of two rows, I don't know the function
Timestamp
2020-11-26 20:00:00
2020-11-26 21:00:00
2020-11-26 22:00:00
2020-11-26 23:30:00
Explanation:
(2020-11-26 21:00:00) - (2020-11-26 20:00:00)
(2020-11-26 23:30:00) - (2020-11-26 22:00:00)
The result must be:
01:00:00
01:30:00
Firstly you need to check if this is as type datetime.
If not, kindly do pd.to_datetime()
demo = pd.DataFrame(columns=['Timestamps'])
demotime = ['20:00:00','21:00:00','22:00:00','23:30:00']
demo['Timestamps'] = demotime
demo['Timestamps'] = pd.to_datetime(demo['Timestamps'])
Your dataframe would look like:
Timestamps
0 2020-11-29 20:00:00
1 2020-11-29 21:00:00
2 2020-11-29 22:00:00
3 2020-11-29 23:30:00
After that you can either use for loop or while and in that just do:
demo.iloc[i+1,0]-demo.iloc[i,0]
IIUC, you want to iterate on chunks of two and find the difference, one approach is to:
res = df.groupby(np.arange(len(df)) // 2).diff().dropna()
print(res)
Output
Timestamp
1 0 days 01:00:00
3 0 days 01:30:00

How to find the min value between two different datetime using Pandas?

(Not duplicate / my question is entirely different)
My dataframe looks like this:
# [df2] is day based
time time2
2017-01-01, 2017-01-01 00:12:00
2017-01-02, 2017-01-02 03:15:00
2017-01-03, 2017-01-03 01:25:00
2017-01-04, 2017-01-04 04:12:00
2017-01-05, 2017-01-05 00:45:00
....
# [df] is minute based
time value
2017-01-01 00:01:00, 0.1232
2017-01-01 00:02:00, 0.1232
2017-01-01 00:03:00, 0.1232
2017-01-01 00:04:00, 0.1232
2017-01-01 00:05:00, 0.1232
....
I want to create a new column called time_val_min in [df2] that finds the min value between df2['time2'] and df2['time'] form [df] within the range specified in df2['time'] and df2['time2']
What did I do?
I did df2['time_val_min'] = df[df['time'].dt.hour.between(df2['time'], df2['time'])].min() but it does not work.
Could you please let me know how to fix it?
You can merge the two data frame on date, and filter the time:
# create the date from the time column
df['date'] = df['time'].dt.normalize()
# merge
new_df = (df.merge(df2, left_on='date', # left on date
right_on='time', # right on time, if time is purely beginning of days
how='right',
suffixes=['','_y'])
.query('time < time2')
.groupby('date')
['time'].min()
.to_frame(name='time_val_min')
.merge(df2, right_on='time', left_index=True)
)
Output:
time_val_min time time2
0 2017-01-01 00:01:00 2017-01-01 2017-01-01 00:12:00

Hourly time series in minutes between two timestamps using Pandas

I have a range of timestamps with start time and end time. I would like to generate the number of minutes per hour between the two timestamps:
import pandas as pd
start_time = pd.to_datetime('2013-03-26 21:49:08',infer_datetime_format=True)
end_time = pd.to_datetime('2013-03-27 05:21:00, infer_datetime_format=True)
pd.date_range(start_time, end_time, freq='h')
which gives:
DatetimeIndex(['2013-03-26 21:49:08', '2013-03-26 22:49:08',
'2013-03-26 23:49:08', '2013-03-27 00:49:08',
'2013-03-27 01:49:08', '2013-03-27 02:49:08',
'2013-03-27 03:49:08', '2013-03-27 04:49:08'],
dtype='datetime64[ns]', freq='H')
Sample result: I would like to compute the number of minutes bounded by the hour between the start and end times, like below:
2013-03-26 21:00:00' - 10m 52secs
2013-03-26 22:00:00' - 60 m
2013-03-26 23:00:00' - 60 m
2013-03-27 05:00:00' - 21 m
I have looked at pandas resample, but not exactly sure how to achieve this. Any direction is appreciated.
Construct two Series corresponding to the start and end time of each hour. Use clip_lower and clip_upper to restrict them to be within your desired timespan, then subtract:
# hourly range, floored to the nearest hour
rng = pd.date_range(start_time.floor('h'), end_time.floor('h'), freq='h')
# get the left and right endpoints for each hour
# clipped to be inclusive of [start_time, end_time]
left = pd.Series(rng, index=rng).clip_lower(start_time)
right = pd.Series(rng + 1, index=rng).clip_upper(end_time)
# construct a series of the lengths
s = right - left
The resulting output:
2013-03-26 21:00:00 00:10:52
2013-03-26 22:00:00 01:00:00
2013-03-26 23:00:00 01:00:00
2013-03-27 00:00:00 01:00:00
2013-03-27 01:00:00 01:00:00
2013-03-27 02:00:00 01:00:00
2013-03-27 03:00:00 01:00:00
2013-03-27 04:00:00 01:00:00
2013-03-27 05:00:00 00:21:00
Freq: H, dtype: timedelta64[ns]
Utilizing datetime.timedelta() in some sort of for loop seems like it's what you're looking for.
https://docs.python.org/2/library/datetime.html#datetime.timedelta
It seems like this might be a viable solution:
import pandas as pd
import datetime as dt
def bounded_min(t, range_time):
""" For a given timestamp t and considered time interval range_time,
return the desired bounded value in minutes and seconds"""
# min() takes care of the end of the time interval,
# max() takes care of the beginning of the interval
s = (min(t + dt.timedelta(hours=1), range_time.max()) -
max(t, range_time.min())).total_seconds()
if s%60:
return "%dm %dsecs" % (s/60, s%60)
else:
return "%dm" % (s/60)
start_time = pd.to_datetime('2013-03-26 21:49:08',infer_datetime_format=True)
end_time = pd.to_datetime('2013-03-27 05:21:00', infer_datetime_format=True)
range_time = pd.date_range(start_time, end_time, freq='h')
# Include the end of the time range using the union() trick, as described at:
# https://stackoverflow.com/questions/37890391/how-to-include-end-date-in-pandas-date-range-method
range_time = range_time.union([end_time])
# This is essentially timestamps for beginnings of hours
index_time = pd.Series(range_time).apply(lambda x: dt.datetime(year=x.year,
month=x.month,
day=x.day,
hour=x.hour,
minute=0,
second=0))
bounded_mins = index_time.apply(lambda x: bounded_min(x, range_time))
# Put timestamps and values together
bounded_df = pd.DataFrame(bounded_mins, columns=["Bounded Mins"]).set_index(index_time)
print bounded_df
Gotta love the powerful lambdas:). Maybe there is a simpler way to do it though.
Output:
Bounded Mins
2013-03-26 21:00:00 10m 52secs
2013-03-26 22:00:00 60m
2013-03-26 23:00:00 60m
2013-03-27 00:00:00 60m
2013-03-27 01:00:00 60m
2013-03-27 02:00:00 60m
2013-03-27 03:00:00 60m
2013-03-27 04:00:00 60m
2013-03-27 05:00:00 21m

How to resample data in a single dataframe within 3 distinct groups

I've got a dataframe and want to resample certain columns (as hourly sums and means from 10-minutely data) WITHIN the 3 different 'users' that exist in the dataset.
A normal resample would use code like:
import pandas as pd
import numpy as np
df = pd.read_csv('example.csv')
df['Datetime'] = pd.to_datetime(df['date_datetime/_source'] + ' ' + df['time']) #create datetime stamp
df.set_index(df['Datetime'], inplace = True)
df = df.resample('1H', how={'energy_kwh': np.sum, 'average_w': np.mean, 'norm_average_kw/kw': np.mean, 'temperature_degc': np.mean, 'voltage_v': np.mean})
df
To geta a result like (please forgive the column formatting, I have no idea how to paste this properly to make it look nice):
energy_kwh norm_average_kw/kw voltage_v temperature_degc average_w
Datetime
2013-04-30 06:00:00 0.027 0.007333 266.333333 4.366667 30.000000
2013-04-30 07:00:00 1.250 0.052333 298.666667 5.300000 192.500000
2013-04-30 08:00:00 5.287 0.121417 302.333333 7.516667 444.000000
2013-04-30 09:00:00 12.449 0.201000 297.500000 9.683333 726.000000
2013-04-30 10:00:00 26.101 0.396417 288.166667 11.150000 1450.000000
2013-04-30 11:00:00 45.396 0.460250 282.333333 12.183333 1672.500000
2013-04-30 12:00:00 64.731 0.440833 276.166667 13.550000 1541.000000
2013-04-30 13:00:00 87.095 0.562750 284.833333 13.733333 2084.500000
However, in the original CSV, there is a column containing URLs - in the dataset of 100,000 rows, there are 3 different URLs (effectively IDs). I want to have each resampled individually rather than having a 'lump' resample from all (e.g. 9.00 AM on 2014-01-01 would have data for all 3 users, but each should have it's own hourly sums and means).
I hope this makes sense - please let me know if I need to clarify anything.
FYI, I tried using the advice in the following 2 posts but to no avail:
Resampling a multi-index DataFrame
Resampling Within a Pandas MultiIndex
Thanks in advance
You can resample a groupby object, groupby-ed by URLs, in this minimal example:
In [157]:
df=pd.DataFrame({'Val': np.random.random(100)})
df['Datetime'] = pd.date_range('2001-01-01', periods=100, freq='5H') #create random dataset
df.set_index(df['Datetime'], inplace = True)
df.__delitem__('Datetime')
df['Location']=np.tile(['l0', 'l1', 'l2', 'l3', 'l4'], 20)
In [158]:
print df.groupby('Location').resample('10D', how={'Val':np.mean})
Val
Location Datetime
l0 2001-01-01 00:00:00 0.334183
2001-01-11 00:00:00 0.584260
l1 2001-01-01 05:00:00 0.288290
2001-01-11 05:00:00 0.470140
l2 2001-01-01 10:00:00 0.381273
2001-01-11 10:00:00 0.461684
l3 2001-01-01 15:00:00 0.703523
2001-01-11 15:00:00 0.386858
l4 2001-01-01 20:00:00 0.448857
2001-01-11 20:00:00 0.310914

Categories

Resources