How do you group data by time buckets and count no of observation in the given bucket. If none, fill the empty time buckets with 0s.
I have the following data set in a dataframe:
'''
df=
Time
0:10
5:00
5:00
5:02
5:03
5:05
5:07
5:09
6:00
6:00
6:00
'''
I would like to create 5 min time bucket going from 00:00 to 23:59, and count how many times it appears in that time bucket. If none, then 0. Basically, each time represents a unit in a queue and and I want to calculate how many in the given time bucket.
From the above data (example set), i would like to get the following:
Time Obs
00:00 0
00:05 0
00:10 1
00:15 0
...
05:00 2
05:05 3
05:10 2
06:00 3
...
I tried the following code
df['time_bucket'] = pd.to_datetime(df['Time']).dt.ceil('5min')
which did not work.
I tried the following as well:
df1= df.resample('5T', on ='time_bucket').count()
which results in :
Time time_bucket
time_bucket
2020-05-24 00:10:00 1 1
2020-05-24 00:15:00 0 0
2020-05-24 00:20:00 0 0
2020-05-24 00:25:00 0 0
2020-05-24 00:30:00 0 0
The time starts at 00:10 but not at 00:00; seems like it starts from the initial value of the time_bucket column.
Basically in the new column, I want to calculate the count. Eventually, I would like to create a function which takes a parameter, ex: time buckets (5, 10, 15) and create table for given time bucket with counts.
I could not find a standard way to address your specific issues, specifically about time (without date) buckets in native pandas functions.
First, instead of your dataset, which seems to be in string format, I used Time().
import random
import datetime
import pandas as pd
from collections import Counter
from datetime import time, timedelta
# generate 10k lines of random data
x = [None]*10000
x = [time(hour=random.randrange(0,24), minute=random.randrange(0,60)) for item in x]
# use Counter to aggregate minute-wise data
y = Counter(x)
z = [{'time':item, 'freq':y.get(item)} for item in y]
df = pd.DataFrame(z)
# create bins
df['tbin']=df['time'].apply(lambda x: x.hour*12 + int(x.minute/5))
df['binStart']=df['time'].apply(lambda x: time(hour=x.hour, minute=(x.minute - x.minute%5)))
df['binEnd']=df['binStart'].apply(lambda a: (datetime.datetime.combine(datetime.datetime.now(), a)+timedelta(minutes=5)).time())
# grouping also orders the data
df_histogram=df.groupby(['tbin', 'binStart', 'binEnd'])['freq'].sum()
this is probably too late, but I was working on solving a similar but simpler problem, and came across your unanswered question which I thought would be more fun to solve than my own (which got solved in the process.)
Related
sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks
the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]
I have two different (and very large) dataframes (details below). And I need to merge the data from both of them. Since these dataframes are huge (with millions of rows in the first one and thousands in the second one), I was trying to use the AWS EMR service. But I don't quite understand how it is done there and the tutorials I've seen mostly show the instructions for one dataframe only. So, I've been wondering how I can use pyspark for two different dataframes.
Here are the details:
The first dataframe, say df, contains data about the people watching tv on different days. It looks like this:
id date other_data
0 0 2020-01-01 some data
1 1 2020-02-01 some data
2 2 2020-03-01 some data
3 3 2020-04-01 some data
4 4 2020-05-01 some data
Here, id is the id of the watcher, date is the date of watching, other_data contains other information (like the duration of watching, channel, etc.)
The second dataframe, say program, contains data about the programs. It looks like this:
date program start_time end_time
0 2020-01-01 program 1 14:00:00 15:00:00
1 2020-01-01 program 2 15:00:00 16:00:00
2 2020-01-01 program 3 16:00:00 17:00:00
3 2020-01-01 program 4 17:00:00 18:00:00
4 2020-01-01 program 5 18:00:00 19:00:00
Here, date is the date, program is the name of the program, start_time and end_time are the time of the program's beginning and end.
Basically, what I need to do is to create one dataframe that would contain all the information from both of these dataframes. I need this final dataframe to have a separate row for each user and each program. In other words, I need a dataframe that would duplicate each row in the first dataframe for each of the programs on the same day.
It might seem a little bit confusing, but here is an example of the final dataframe I want to receive:
id date other_data program start_time end_time
0 0 2020-01-01 some data program 1 14:00:00 15:00:00
1 0 2020-01-01 some data program 2 15:00:00 16:00:00
2 0 2020-01-01 some data program 3 16:00:00 17:00:00
3 0 2020-01-01 some data program 4 17:00:00 18:00:00
4 0 2020-01-01 some data program 5 18:00:00 19:00:00
As you can see, this final dataframe contains the data for each user and each program that was shown on the same day this user watched tv. In this particular case, the user with id=0 has watched tv on 01/01/2020. On the same day, program 1, program 2, program 3, program 4, and program 5 were shown. So, I need to have one row for each of the programs with their details. And, of course, I need the data from the first dataframe (contained in other_data).
So far, I created the following approach: I iterate over the first dataframe, for each row I find all the rows in the second dataframe that have the same date, merge it and add to the third (final) dataframe.
Here is the code I use:
ids = [] # users' id
dates = [] # dates
other_data = [] # other data from the first dataframe
programs = [] # all programs
start_times = [] # starting times
end_times = [] # ending times
for i, row in df.iterrows():
temp = program.loc[program['date'] == row['date']] # find all programs on the same date
for j, program_row in temp.iterrows(): # iterate over the programs on the same date
# append all the info
ids.append(row['id'])
dates.append(row['date'])
other_data.append(row['other_data'])
programs.append(program_row['program'])
start_times.append(program_row['start_time'])
end_times.append(program_row['end_time'])
# create final dataframe
final = pd.DataFrame({'id': ids, 'date': dates, 'other_data': other_data, 'program': programs,
'start_time': start_times, 'end_time': end_times})
This approach is working, but it is extremely slow (considering the large size of dataframes). I was, therefore, wondering how to split this job between several workers using ERM by AWS. If I understand it correctly, I need to split the first dataframe df between workers, and, at the same time, provide them with the full program dataframe. Is it possible to do that? And how?
Would appreciate any help or advice!
It seems that both df and program are Pandas dataframes and merging/joining is the action needed, see pandas.DataFrame.merge. Try this:
import pandas as pd
finial = pd.merge(df, program, on=['date'], how='inner')
In case the Pandas version is too slow, you could convert the dataframes to PySPark dataframes and perform the following steps:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("convert").getOrCreate()
df_spark = spark.createDataFrame(df)
program_spark = spark.createDataFrame(program)
final_spark = df_spark.join(F.broadcast(program), on=['date'], how='inner')
Here, it is assumed that the dataframe program is a small dataframe - if not, please remove the broadcast.
Hopefully, it is solving your issue and removing the slow loops here.
I have a pandas DataFrame:
I want to calculate the diffrence between confirm and cancel in the following way:
For date 13.01.2020 and desk_id 1.0 : 10:35:00 – 8:00:00 + 12:36:00 – 11:36:00 + 20:00:00 - 13:36:00
I was able to perform these actions only for a desk with one hour of confirm and cancel. By one hour I mean that in date for desk_id I have only one row for confirm and cancel time. The interesting diff and I get when I subtract from confirm 8:00:00 and from 20:00:00 the cancel time and add them together.
For many hours, I can't put it together. By mamy hour I mean that desk_id in one date have few rows with cancel and confirm time. I would like to choose the date, desk_id and calculate the desk occupancy time - the difference between confirm and cancel for each desk.
Output should looks like:
I would like to find periods of time when a desk is free.
In my data can be many confirms and cancels for desk in one date.
I did it for one hour confirm and cancel:
df_1['confirm'] = pd.to_timedelta(df_1['confirm'].astype(str))
df_1['diff_confirm'] = df_1['confirm'].apply(lambda x: x - datetime.timedelta(days=0, hours=8, minutes=0))
df_1['cancel'] = pd.to_timedelta(df_1['cancel'].astype(str))
df_1['diff_cancel'] = df_1['cancel'].apply(lambda x: datetime.timedelta(days=0, hours=20, minutes=0)-x)
and this works.
Any tips?
You did not make it entirely clear what format you need your results in, but I assume it is okay to put them in a separate dataframe. So this solution operates on each group of rows defined by values of date and desk_id and computes the total time for each group, with output placed in a new dataframe:
Code to create your input dataframe:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame(
{
'date': [pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-13'),
pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-14'),
pd.Timestamp('2020-1-14'), pd.Timestamp('2020-1-14')],
'desk_id': [1.0, 1.0, 2.0, 1.0, 2.0, 2.0],
'confirm': ['10:36:00', '12:36:00', '09:36:00', '10:36:00', '12:36:00',
'15:36:00'],
'cancel': ['11:36:00', '13:36:00', '11:36:00', '11:36:00', '14:36:00',
'16:36:00']
}
)
Solution:
df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])
# function to compute total time each desk is free
def total_time(df):
return (
(df.iloc[0]['confirm'] - timedelta(days=0, hours=8, minutes=0)) +
(df['confirm'] - df['cancel'].shift()).sum() +
(timedelta(days=0, hours=20, minutes=0) - df.iloc[-1]['cancel'])
)
# apply function to each combination of 'desk_id' and 'date', producing
# a new dataframe
df.groupby(['desk_id', 'date']).apply(total_time).reset_index(name='total_time')
# desk_id date total_time
# 0 1.0 2020-01-13 0 days 10:00:00
# 1 1.0 2020-01-14 0 days 11:00:00
# 2 2.0 2020-01-13 0 days 10:00:00
# 3 2.0 2020-01-14 0 days 09:00:00
The function takes the difference between the first value of confirm and 8:00:00, takes differences between each confirm and preceding cancel values, and then the difference between 20:00:00 and the last value of cancel. Those differences added together to produce the final value.
One guess at what you're trying to do (I still can't fully understand, but here's an attempt):
import pandas as pd
from datetime import timedelta as td
#create the dataframe
a = pd.DataFrame({'data':['2020-01-13','2020-01-13','2020-01-14'],'desk_id':[1.0,1.0,1.0],'confirm':['10:36:00','12:36:00','13:14:00'],'cancel':['11:36:00','13:36:00','13:44:00']})
def get_avail_times(df,start_end_delta=td(hours=12)):
df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])
#group by the two keys so that we can perform calculations on the specific groups!!
df_g = df.groupby(['data','desk_id'], as_index=False).sum()
df_g['total_time'] = start_end_delta - df_g['cancel'] + df_g['confirm']
return df_g.drop('confirm',1).drop('cancel',1)
output = get_avail_times(a)
Which gives the output:
data desk_id total_time
0 2020-01-13 1.0 0 days 10:00:00
1 2020-01-14 1.0 0 days 11:30:00
The key here is to use the .groupby() function which we can then sum together to essentially perform the equation:
total_time = 20:00 + sum_confirm_times - sum_cancel_times - 08:00
I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.
I am working on a large datasets that looks like this:
Time, Value
01.01.2018 00:00:00.000, 5.1398
01.01.2018 00:01:00.000, 5.1298
01.01.2018 00:02:00.000, 5.1438
01.01.2018 00:03:00.000, 5.1228
01.01.2018 00:04:00.000, 5.1168
.... , ,,,,
31.12.2018 23:59:59.000, 6.3498
The data is a minute data from the first day of the year to the last day of the year
I want to use Pandas to find the average of every 5 days.
For example:
Average from 01.01.2018 00:00:00.000 to 05.01.2018 23:59:59.000 is average for 05.01.2018
The next average will be from 02.01.2018 00:00:00.000 to 6.01.2018 23:59:59.000 is average for 06.01.2018
The next average will be from 03.01.2018 00:00:00.000 to 7.01.2018 23:59:59.000 is average for 07.01.2018
and so on... We are incrementing day by 1 but calculating an average from the day to past 5days, including the current date.
For a given day, there are 24hours * 60minutes = 1440 data points. So I need to get the average of 1440 data points * 5 days = 7200 data points.
The final DataFrame will look like this, time format [DD.MM.YYYY] (without hh:mm:ss) and the Value is the average of 5 data including the current date:
Time, Value
05.01.2018, 5.1398
06.01.2018, 5.1298
07.01.2018, 5.1438
.... , ,,,,
31.12.2018, 6.3498
The bottom line is to calculate the average of data from today to the past 5 days and the average value is shown as above.
I tried to iterate through Python loop but I wanted something better than we can do from Pandas.
Perhaps this will work?
import numpy as np
# Create one year of random data spaced evenly in 1 minute intervals.
np.random.seed(0) # So that others can reproduce the same result given the random numbers.
time_idx = pd.date_range(start='2018-01-01', end='2018-12-31', freq='min')
df = pd.DataFrame({'Time': time_idx, 'Value': abs(np.random.randn(len(time_idx))) + 5})
>>> df.shape
(524161, 2)
Given the dataframe with 1 minute intervals, you can take a rolling average over the past five days (5 days * 24 hours/day * 60 minutes/hour = 7200 minutes) and assign the result to a new column named rolling_5d_avg. You can then group on the original timestamps using the dt accessor method to grab the date, and then take the last rolling_5d_avg value for each date.
df = (
df
.assign(rolling_5d_avg=df.rolling(window=5*24*60)['Value'].mean())
.groupby(df['Time'].dt.date)['rolling_5d_avg']
.last()
)
>>> df.head(10)
Time
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 NaN
2018-01-04 NaN
2018-01-05 5.786603
2018-01-06 5.784011
2018-01-07 5.790133
2018-01-08 5.786967
2018-01-09 5.789944
2018-01-10 5.789299
Name: rolling_5d_avg, dtype: float64