Create new DataFrame based on hours range of a column - python

My df is:
ordinal id latitude longitude timestamp epoch day_of_week
1.0 38 44.9484 7.7728 2016-06-01 08:18:46.000 1.464769 Wednesday
2.0 38 44.9503 7.7748 2016-06-01 08:28:05.000 1.464770 Wednesday
3.0 38 44.9503 7.7748 2016-06-01 08:38:09.000 1.464770 Wednesday
I want to create a new df1, df2, df3 based on hours range:
Ex: from 2016-06-01 08:00:00.000 to 2016-06-01 09:00:00.000 (from 8 o clock to 9 o clock) I want to have
1.0 38 44.9484 7.7728 2016-06-01 08:18:46.000 1.464769 Wednesday
2.0 38 44.9503 7.7748 2016-06-01 08:28:05.000 1.464770 Wednesday
I want to do it for all 24 hours. If it is possible I want to do it by code which can be applied to the whole column or I can do it one by one

You don't describe why you want to generate hour-specific slices of the raw data. In general, this would be considered bad practice or not pythonic.
I suggest to group your data based on the hour using groupby which allows you to loop through these slices, here the data frames group.
Here's a minimal working example:
import pandas as pd
import numpy as np
iN = 100
data_char = np.random.randint(0, 100, size=100)
timestamp = pd.date_range(start='2018-04-24', end='2018-04-25', periods=100)
data = {'data_char': data_char, 'timestamp': timestamp}
df = pd.DataFrame.from_dict(data)
for hour, group in df.groupby(df['timestamp'].dt.hour):
print(hour)
print(group)

Related

Time elapsed since first log for each user

I'm trying to calculate the time difference between all the logs of a user and the first log of that same user. There are users with several logs.
The dataframe looks like this:
16 00000021601 2022-08-23 17:12:04
20 00000021601 2022-08-23 17:12:04
21 00000031313 2022-10-22 11:16:57
22 00000031313 2022-10-22 12:16:44
23 00000031313 2022-10-22 14:39:07
24 00000065137 2022-05-06 11:51:33
25 00000065137 2022-05-06 11:51:33
I know that I could do df['DELTA'] = df.groupby('ID')['DATE'].shift(-1) - df['DATE'] to get the difference between consecutive dates for each user, but since something like iat[0] doesn't work in this case I don't know how to get the difference in relation to the first date.
You can try this code
import pandas as pd
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
df.groupby('id').apply(lambda x: x['dates'] - x.iloc[0, 0])
Out:
id
1 0 0 days 00:00:00
1 0 days 00:00:00
2 59 days 18:04:53
2 3 0 days 00:00:00
4 0 days 02:22:23
5 -170 days +23:34:49
6 -170 days +23:34:49
Name: dates, dtype: timedelta64[ns]
If you dataframe is large and apply took a long time you can try use parallel-pandas. It's very simple
import pandas as pd
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(n_cpu=8)
dates = ['2022-08-23 17:12:04',
'2022-08-23 17:12:04',
'2022-10-22 11:16:57',
'2022-10-22 12:16:44',
'2022-10-22 14:39:07',
'2022-05-06 11:51:33',
'2022-05-06 11:51:33',]
ids = [1,1,1,2,2,2,2]
df = pd.DataFrame({'id':ids, 'dates':dates})
df['dates'] = pd.to_datetime(df['dates'])
#p_apply is parallel analogue of apply method
df.groupby('id').p_apply(lambda x: x['dates'] - x.iloc[0, 0])
It will be 5-10 time faster

Selecting multiple ranges of dates from dataframe

I have a dataframe with dates and prices (as below).
df=pd.DataFrame({'date':['2015-01-01','2015-01-02','2015-01-03',
'2016-01-01','2016-01-02','2016-01-03',
'2017-01-01','2017-01-02','2017-01-03',
'2018-01-01','2018-01-02','2018-01-03'],
'price':[78,87,52,94,55,45,68,76,65,75,78,21]
})
df['date'] = pd.to_datetime(df['date'], errors='ignore', format='%Y%m%d')
select_dates = df.set_index(['date'])
I want to select a range of specific dates to add to a new dataframe. For example, I would like to select prices for the first quarter of 2015 and the first quarter of 2016. I have provided data for a shorter time period for the example, so in this case, I would like to select the first 2 days of 2015 and the first 2 days of 2016.
I would like to end up with a dataframe like this (with date as the index).
date
price
2015-01-01
78
2015-01-02
87
2016-01-01
94
2016-01-02
55
I have been using this method to select dates, but I don't know how to select more than one range at a time
select_dates2=select_dates.loc['2015-01-01':'2015-01-02']
Another way:
df['date'] = pd.to_datetime(df['date'])
df[df.date.dt.year.isin([2015, 2016]) & df.date.dt.day.lt(3)]
date price
0 2015-01-01 78
1 2015-01-02 87
3 2016-01-01 94
4 2016-01-02 55
One option is to use the dt accessor to select certain years and month-days, then use isin to create a boolean mask to filter df.
df['date'] = pd.to_datetime(df['date'])
out = df[df['date'].dt.year.isin([2015, 2016]) & df['date'].dt.strftime('%m-%d').isin(['01-01','01-02'])]
Output:
date price
0 2015-01-01 78
1 2015-01-02 87
3 2016-01-01 94
4 2016-01-02 55
one option is to get the index as a MultiIndex of date objects; this allows for a relatively easy selection on multiple levels (in this case, year and day):
(df
.assign(year = df.date.dt.year, day = df.date.dt.day)
.set_index(['year', 'day'])
.loc(axis = 0)[2015:2016, :2]
)
date price
year day
2015 1 2015-01-01 78
2 2015-01-02 87
2016 1 2016-01-01 94
2 2016-01-02 55

Resample dataframe based on time ranges, ignoring date

I am trying to resample my data to get sums. This resampling needs to be based solely on time. I want to group the times in 6 hours, so regardless of the date I will get 4 sums.
My df looks like this:
booking_count
date_time
2013-04-04 08:32:25 58
2013-04-04 18:43:11 1
2013-30-04 12:39:15 52
2013-14-05 06:51:33 99
2013-01-06 23:59:17 1
2013-03-06 19:37:25 42
2013-27-06 04:12:01 38
With this example data, I expect the get the following results:
00:00:00 38
06:00:00 157
12:00:00 52
18:00:00 43
To get around the date issue, I tried to keep only the time values:
df['time'] = pd.DatetimeIndex(df['date_time']).time
new_df = df[['time', 'booking_bool']].set_index('time').resample('360min').sum()
Unfortunately, this was to no avail. How do I go about getting my required results? Is resample() even suitable for this task?
I don't think resample() is a good method to do this because you need to groupby based on hours independently of the day. Maybe you can try using cut using a custom bins parameter, and then a usual groupby
bins = np.arange(start=0, stop=24+6, step=6)
group = df.groupby(pd.cut(
df.index.hour,
bins, right=False,
labels=pd.date_range('00:00:00', '18:00:00', freq='6H').time)
).sum()
group
# booking_count
# 00:00:00 38
# 06:00:00 157
# 12:00:00 52
# 18:00:00 44

How to calculate a mean of measurements taken at the same time (n-hours window) on different days in pandas dataframe?

I have a dataset with measurements acquired almost every 2-hours over a week. I would like to calculate a mean of measurements taken at the same time on different days. For example, I want to calculate the mean of every measurement taken between 12:00 and 13:59.
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
#generating test dataframe
date_today = datetime.now()
time_of_taken_measurment = pd.date_range(date_today, date_today +
timedelta(72), freq='2H20MIN')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100,
size=len(time_of_taken_measurment))
df = pd.DataFrame({'measurementTimestamp': time_of_taken_measurment, 'measurment': data})
df = df.set_index('measurementTimestamp')
#Calculating the mean for measurments taken in the same hour
hourly_average = df.groupby([df.index.hour]).mean()
hourly_average
The code above gives me this output:
0 47.967742
1 43.354839
2 46.935484
.....
22 42.833333
23 52.741935
I would like to have a result like this:
0 mean0
2 mean1
4 mean2
.....
20 mean10
22 mean11
I was trying to solve my problem using rolling_mean function, but I could not find a way to apply it to my static case.
Use the built-in floor functionality of datetimeIndex, which allows you to easily create 2 hour time bins.
df.groupby(df.index.floor('2H').time).mean()
Output:
measurment
00:00:00 51.516129
02:00:00 54.868852
04:00:00 52.935484
06:00:00 43.177419
08:00:00 43.903226
10:00:00 55.048387
12:00:00 50.639344
14:00:00 48.870968
16:00:00 43.967742
18:00:00 49.225806
20:00:00 43.774194
22:00:00 50.590164

How to make a histogram of pandas datetimes per specific time interval?

I want to plot some datetimes and would like to specify a time interval in order to bundle them together and make a histogram. So for example, if there happen to be n datetimes in the span of one hour, group them together or parse them as year, month, day, hour. And omit minutes and seconds.
Let's say I have a data frame with some datetime values:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
date_today = datetime.now()
days = pd.date_range(date_today, date_today + timedelta(7), freq='D')
np.random.seed(seed=1111)
data = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'test': days, 'col2': data})
df = df.set_index('test')
print(df)
2018-06-19 17:10:32.076646 29
2018-06-20 17:10:32.076646 56
2018-06-21 17:10:32.076646 82
2018-06-22 17:10:32.076646 13
2018-06-23 17:10:32.076646 35
2018-06-24 17:10:32.076646 53
2018-06-25 17:10:32.076646 25
2018-06-26 17:10:32.076646 23
Ideally, I would like to specify a more flexible time interval, such as "6 hours" in order to make some sort of modulo operation on the datetimes. Is this possible?
pd.Grouper
Allows you to specify regular frequency intervals with which you will group your data. Use groupby to then aggregate your df based on these groups. For instance, if col2 was counts and you wanted to bin together all of the counts over 2 day intervals, you could do:
import pandas as pd
df.groupby(pd.Grouper(level=0, freq='2D')).col2.sum()
Outputs:
test
2018-06-19 13:49:11.560185 85
2018-06-21 13:49:11.560185 95
2018-06-23 13:49:11.560185 88
2018-06-25 13:49:11.560185 48
Name: col2, dtype: int32
You group by level=0, that is your index labeled 'test' and sum col2 over 2 day bins. The behavior of pd.Grouper can be a little annoying since in this example the bins start and end at 13:49:11..., which likely isn't what you want.
pd.cut + pd.date_range
You have a bit more control over defining your bins if you define them with pd.date_range and then use pd.cut. Here for instance, you can define bins every 2 days beginning on the 19th.
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19', '2018-06-27', freq='2D'))).col2.sum()
Outputs:
(2018-06-19, 2018-06-21] 85
(2018-06-21, 2018-06-23] 95
(2018-06-23, 2018-06-25] 88
(2018-06-25, 2018-06-27] 48
Name: col2, dtype: int32
This is nice, because if you instead wanted the bins to begin on even days you can just change the start and end dates in pd.date_range
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-18', '2018-06-28', freq='2D'))).col2.sum()
Outputs:
(2018-06-18, 2018-06-20] 29
(2018-06-20, 2018-06-22] 138
(2018-06-22, 2018-06-24] 48
(2018-06-24, 2018-06-26] 78
(2018-06-26, 2018-06-28] 23
Name: col2, dtype: int32
If you really wanted to, you could specify 2.6 hour bins beginning on June 19th 2018 at 5 AM:
df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19 5:00:00', '2018-06-28 5:00:00', freq='2.6H'))).col2.sum()
#(2018-06-19 05:00:00, 2018-06-19 07:36:00] 0
#(2018-06-19 07:36:00, 2018-06-19 10:12:00] 0
#(2018-06-19 10:12:00, 2018-06-19 12:48:00] 0
#(2018-06-19 12:48:00, 2018-06-19 15:24:00] 29
#....
Histogram.
Just use .plot(kind='bar') after you have aggregated the data.
(df.groupby(pd.cut(df.index,
pd.date_range('2018-06-19', '2018-06-28', freq='2D')))
.col2.sum().plot(kind='bar', color='firebrick', rot=30))

Categories

Resources