Pandas: How to extract rows which are just within a time duration? - python

I have a data-frame like this.
value estimated \
dttm_timezone
2011-12-31 20:10:00 10.7891 0
2011-12-31 20:15:00 11.2060 0
2011-12-31 20:20:00 19.9975 0
2011-12-31 20:25:00 15.9975 0
2011-12-31 20:30:00 10.9975 0
2011-12-31 20:35:00 13.9975 0
2011-12-31 20:40:00 15.9975 0
2011-12-31 20:45:00 11.7891 0
2011-12-31 20:50:00 10.9975 0
2011-12-31 20:55:00 10.3933 0
By using the dttm_timezone column information, I would like to extract all the rows which are just within a day or a week or a month.
I have data of 1 year, so if I select day as the duration I should extract 365 days data separately, if I select month then I should extract a 12 months data separately.
How can I achieve this?

Let's use
import pandas as pd
import numpy as np
tidx = pd.date_range('2010-01-01', '2014-12-31', freq='H', name='dtime')
np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(len(tidx)), tidx, ['value'])
You can limit to '2010' like this:
df['2010']
Or
df[df.index.year == 2010]
You can limit to a specific month by:
df['2010-04']
or all Aprils:
df[df.index.month == 4]
You can limit to a specific day:
df['2010-04-28']
all 1:00 pm's:
df[df.index.hour == 13]
range of dates:
df['2011':'2013']
or
df['2011-01-01':'2013-06-30']
There is ton of ways to do this:
df.loc[(df.index.month == 11) & (df.index.hour == 22)]
link ---> The list can go on and on. Please read the docs <--- link

Related

Pandas - Add mean, max, min as row in dataframe

Dataframe evntually converts to Excel...
Trying to create a additional row with the avg and max above each column.
Do not want to disturb the original headers for the actual data.
enter image description here
I dont want to hard-code column names as these will change need kind of abstract. I attempted to create a max but failed. I need the max above the column headers.
Try this, I don't know how to create above the dataframe, but I believe that in the end it might be a good solution:
import pandas as pd
df = {
'date and time':['2022-03-01', '2022-03-02', '2022-03-03', '2022-03-04'],
'<PowerAC--->':[40, 20, 9, 12]
}
df = pd.DataFrame(df)
cols = ['<PowerAC--->']
agg = (df[cols].agg(['mean', max]))
out = pd.concat([df, agg])
print(out)
A one-liner method which also remove the "NaN" values to make it visually better (I'm a bit OCD ;))
df.append(df.agg({'<PowerAC--->' : ['mean', max]})).fillna('')
I would say it's a good idea to keep your data separated from the reporting on it - I don't really see the logic for an "additional row above the column".
I would generate statistics for the overall data as a separate dataframe.
import pandas as pd
import numpy as np
np.random.seed(1)
t = pd.date_range(start='2022-05-31', end='2022-06-07')
x = np.random.rand(len(t))
df = pd.DataFrame({'date': t, 'data': x})
print(df)
# The 'numeric_only=False' behaviour will become default in a future version of pandas
d_mean = df.mean(numeric_only=False)
d_max = df.max()
# We need to transpose this, as the `d_mean` and `d_max` are Series (columns), and we want them as rows
df_stats = pd.DataFrame({'mean': d_mean, 'max':d_max}).transpose()
print(df_stats)
df output:
date data
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114
3 2022-06-03 0.302333
4 2022-06-04 0.146756
5 2022-06-05 0.092339
6 2022-06-06 0.186260
7 2022-06-07 0.345561
df_stats output:
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
You could add this and the dataframe together with
pd.concat([df_stats, df])
which looks like
date data
mean 2022-06-03 12:00:00 0.276339
max 2022-06-07 00:00:00 0.720324
0 2022-05-31 00:00:00 0.417022
1 2022-06-01 00:00:00 0.720324
2 2022-06-02 00:00:00 0.000114
3 2022-06-03 00:00:00 0.302333
4 2022-06-04 00:00:00 0.146756
5 2022-06-05 00:00:00 0.092339
6 2022-06-06 00:00:00 0.18626
7 2022-06-07 00:00:00 0.345561
but I would keep them separate unless you've got a very good reason to.
There may be some way which makes sense using a multi-index, but that's a bit beyond me, and probably beyond the scope of this question.
Edit: If you don't infer any meaning from the max and mean of the date column but still want something compatiable with that column (i.e. still a datetime but effectively null) you could replace it by np.datetime64['NaT'] (NaT similar to NaN, but "not a time"):
df_stats['date'] = np.datetime64['NaT']
print(pd.concat([df_stats, df]).head())
output:
date data
mean NaT 0.276339
max NaT 0.720324
0 2022-05-31 0.417022
1 2022-06-01 0.720324
2 2022-06-02 0.000114

How do I groupby activity with hours data type in pandas

I have been trying to sum the hours by activity in a dataframe but it didn't work.
the code:
import pandas as pd
fileurl = r'https://docs.google.com/spreadsheets/d/1WuvvsZCfbcioYLvwwHuSunUbs4tjvv05/edit?usp=sharing&ouid=105286407332351152540&rtpof=true&sd=true'
df = pd.read_excel(fileurl, header=0)
df.groupby('Activity').sum()
excel link : https://docs.google.com/spreadsheets/d/1WuvvsZCfbcioYLvwwHuSunUbs4tjvv05/edit?usp=sharing&ouid=105286407332351152540&rtpof=true&sd=true
You have to force hours column to be strings else you will get datetime.time instance.
df = pd.read_excel(fileurl, header=0, dtype={'hours': str})
out = (df.assign(hours=pd.to_timedelta(df['hours']))
.groupby('Activity', as_index=False)['hours'].sum())
print(out)
# Output
Activity hours
0 bushwalking 0 days 04:45:00
1 cycling 0 days 11:30:00
2 football 0 days 03:42:00
3 gym 0 days 07:00:00
4 running 0 days 14:00:00
5 swimming 0 days 13:15:00
6 walking 0 days 04:00:00

Dataframe from Series grouped by weekday and hour of day

I have a Series with a DatetimeIndex, as such :
time my_values
2017-12-20 09:00:00 0.005611
2017-12-20 10:00:00 -0.004704
2017-12-20 11:00:00 0.002980
2017-12-20 12:00:00 0.001497
...
2021-08-20 13:00:00 -0.001084
2021-08-20 14:00:00 -0.001608
2021-08-20 15:00:00 -0.002182
2021-08-20 16:00:00 -0.012891
2021-08-20 17:00:00 0.002711
I would like to create a dataframe of average values with the weekdays as columns names and hour of the day as index, resulting in this :
hour Monday Tuesday ... Sunday
0 0.005611 -0.001083 -0.003467
1 -0.004704 0.003362 -0.002357
2 0.002980 0.019443 0.009814
3 0.001497 -0.002967 -0.003466
...
19 -0.001084 0.009822 0.003362
20 -0.001608 -0.002967 -0.003567
21 -0.002182 0.035600 -0.003865
22 -0.012891 0.002945 -0.002345
23 0.002711 -0.002458 0.006467
How can do this in Python ?
Do as follows
# Coerce time to datetime
df['time'] = pd.to_datetime(df['time'])
# Extract day and hour
df = df.assign(day=df['time'].dt.strftime('%A'), hour=df['time'].dt.hour)
# Pivot
pd.pivot_table(df, values='my_values', index=['hour'],
columns=['day'], aggfunc=np.mean)
Since you asked for a solution that returns the average values, I propose this groupby solution
df["weekday"] = df.time.dt.strftime('%A')
df["hour"] = df.time.dt.strftime('%H')
df = df.drop(["time"], axis=1)
# calculate averages by weekday and hour
df2 = df.groupby(["hour", "weekday"]).mean()
# put it in the right format
df2.unstack()

Elegant way to shift multiple date columns - Pandas

I have a dataframe like as shown below
df = pd.DataFrame({'person_id': [11,11,11,21,21],
'offset' :['-131 days','29 days','142 days','20 days','-200 days'],
'date_1': ['05/29/2017', '01/21/1997', '7/27/1989','01/01/2013','12/31/2016'],
'dis_date': ['05/29/2017', '01/24/1999', '7/22/1999','01/01/2015','12/31/1991'],
'vis_date':['05/29/2018', '01/27/1994', '7/29/2011','01/01/2018','12/31/2014']})
df['date_1'] = pd.to_datetime(df['date_1'])
df['dis_date'] = pd.to_datetime(df['dis_date'])
df['vis_date'] = pd.to_datetime(df['vis_date'])
I would like to shift all the dates of each subject based on his offset
Though my code works (credit - SO), I am looking for an elegant approach. You can see am kind of repeating almost the same line thrice.
df['offset_to_shift'] = pd.to_timedelta(df['offset'],unit='d')
#am trying to make the below lines elegant/efficient
df['shifted_date_1'] = df['date_1'] + df['offset_to_shift']
df['shifted_dis_date'] = df['dis_date'] + df['offset_to_shift']
df['shifted_vis_date'] = df['vis_date'] + df['offset_to_shift']
I expect my output to be like as shown below
Use, DataFrame.add along with DataFrame.add_prefix and DataFrame.join:
cols = ['date_1', 'dis_date', 'vis_date']
df = df.join(df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_'))
OR, it is also possible to use pd.concat:
df = pd.concat([df, df[cols].add(df['offset_to_shift'], 0).add_prefix('shifted_')], axis=1)
OR, we can also directly assign the new shifted columns to the dataframe:
df[['shifted_' + col for col in cols]] = df[cols].add(df['offset_to_shift'], 0)
Result:
# print(df)
person_id offset date_1 dis_date vis_date offset_to_shift shifted_date_1 shifted_dis_date shifted_vis_date
0 11 -131 days 2017-05-29 2017-05-29 2018-05-29 -131 days 2017-01-18 2017-01-18 2018-01-18
1 11 29 days 1997-01-21 1999-01-24 1994-01-27 29 days 1997-02-19 1999-02-22 1994-02-25
2 11 142 days 1989-07-27 1999-07-22 2011-07-29 142 days 1989-12-16 1999-12-11 2011-12-18
3 21 20 days 2013-01-01 2015-01-01 2018-01-01 20 days 2013-01-21 2015-01-21 2018-01-21
4 21 -200 days 2016-12-31 1991-12-31 2014-12-31 -200 days 2016-06-14 1991-06-14 2014-06-14

Is there a Pandas function to highlight a week's 10 lowest values in a time series?

Rookie here so please excuse my question format:
I got an event time series dataset for two months (columns for "date/time" and "# of events", each row representing an hour).
I would like to highlight the 10 hours with the lowest numbers of events for each week. Is there a specific Pandas function for that? Thanks!
Let's say you have a dataframe df with column col as well as a datetime column.
You can simply sort the column with
import pandas as pd
df = pd.DataFrame({'col' : [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15],
'datetime' : ['2019-01-01 00:00:00','2015-02-01 00:00:00','2015-03-01 00:00:00','2015-04-01 00:00:00',
'2018-05-01 00:00:00','2016-06-01 00:00:00','2017-07-01 00:00:00','2013-08-01 00:00:00',
'2015-09-01 00:00:00','2015-10-01 00:00:00','2015-11-01 00:00:00','2015-12-01 00:00:00',
'2014-01-01 00:00:00','2020-01-01 00:00:00','2014-01-01 00:00:00']})
df = df.sort_values('col')
df = df.iloc[0:10,:]
df
Output:
col datetime
0 1 2019-01-01 00:00:00
1 2 2015-02-01 00:00:00
2 3 2015-03-01 00:00:00
3 4 2015-04-01 00:00:00
4 5 2018-05-01 00:00:00
5 6 2016-06-01 00:00:00
6 7 2017-07-01 00:00:00
7 8 2013-08-01 00:00:00
8 9 2015-09-01 00:00:00
9 10 2015-10-01 00:00:00
I know there's a function called nlargest. I guess there should be an nsmallest counterpart. pandas.DataFrame.nsmallest
df.nsmallest(n=10, columns=['col'])
My bad, so your DateTimeIndex is a Hourly sampling. And you need the hour(s) with least events weekly.
...
Date n_events
2020-06-06 08:00:00 3
2020-06-06 09:00:00 3
2020-06-06 10:00:00 2
...
Well I'd start by converting each hour into columns.
1. Create an Hour column that holds the hour of the day.
df['hour'] = df['date'].hour
Pivot the hour values into columns having values as n_events.
So you'll then have 1 datetime index, 24 hour columns, with values denoting #events. pandas.DataFrame.pivot_table
...
Date hour0 ... hour8 hour9 hour10 ... hour24
2020-06-06 0 3 3 2 0
...
Then you can resample it to weekly level aggregate using sum.
df.resample('w').sum()
The last part is a bit tricky to do on the dataframe. But fairly simple if you just need the output.
for row in df.itertuples():
print(sorted(row[1:]))

Categories

Resources