Extract minutes per hour based on occupancy in Excel - python

Is there an easy way to extract the minutes per hour a room was used based on occupancy level? I would like to get an overview of how many minutes room 1 was used from 08:00:00- 08:59:59, 09:00:00-09:59:59..etc
I have done this manually by creating time intervals for every hour starting at fex 08:00:00 and ending at 08:59:59. Then I have used a sumif formula to get number of minutes the room was occupied per hour for one day (9 hours in total per day).
As I want to see how many minutes per hour different rooms are occupied and compare them, I wonder if there is an easier way to do this? It would be great to have a format that I could use for all rooms. However, since all rooms will have different timestamps, this might be difficult?
If anyone knows how to do this in SQL or Python, that would be very helpful as well, especially in SQL!
The link below will give you an example of the data.

In python, the most analogous data structure to a spreadsheet or a SQL table is the DataFrame from the pandas library.
First we can read in data from a spreadsheet like so:
import pandas as pd
df = pd.read_excel("<your filename>", parse_dates=[1])
df["Time"] = df.Timestamp.dt.time
Here I am going to assume you have removed your work-in-progress (table on the right in the image) and that the data is in the first worksheet of the Excel file (otherwise we'll have to pass additional options).
I've ensured that the first (Timestamp) column is correctly understood as containing date-time data. By default it will assume 09.01.2020 ... refers to the 1st of September, American-style - I'm guessing that's what you want; additional options can be passed if you were really referring to the 9th of January (which is how I'd read that date).
I then overwrote the Time column with a time object extracted from the Timestamp, this isn't really necessary but gets the data as close to what was in the spreadsheet as possible. The DataFrame now looks like this:
Timestamp Room name Occupancy % Time
0 2020-09-01 08:04:01 Room 1 0 08:04:01
1 2020-09-01 09:04:01 Room 1 100 09:04:01
2 2020-09-01 09:19:57 Room 1 0 09:19:57
3 2020-09-01 09:48:57 Room 1 0 09:48:57
4 2020-09-01 09:53:01 Room 1 100 09:53:01
5 2020-09-01 10:05:01 Room 1 100 10:05:01
6 2020-09-01 10:08:57 Room 1 100 10:08:57
7 2020-09-01 10:13:01 Room 1 100 10:13:01
(Note for next time, it would have been good to include something like this text in your question, it makes it much easier to construct an answer if the data doesn't have to be painstakingly put together)
Now, there are a lot of things we can do with a DataFrame like this, but I'm going to try and get to where you want to go as directly as possible.
We'll start by using the Timestamp column as the 'index' and prepending a row for the time 08:00:00 because it's not currently part of your dataset, but you indicated you want it.
df2 = df.set_index("Timestamp")
df2.loc[pd.Timestamp("09.01.2020 08:00:00")] = ("Room1", 0.0, None)
df2.sort_index(inplace=True)
The result looks like this:
Room name Occupancy % Time
Timestamp
2020-09-01 08:00:00 Room 1 0.0 None
2020-09-01 08:04:01 Room 1 0.0 08:04:01
2020-09-01 09:04:01 Room 1 100.0 09:04:01
2020-09-01 09:19:57 Room 1 0.0 09:19:57
2020-09-01 09:48:57 Room 1 0.0 09:48:57
2020-09-01 09:53:01 Room 1 100.0 09:53:01
2020-09-01 10:05:01 Room 1 100.0 10:05:01
2020-09-01 10:08:57 Room 1 100.0 10:08:57
2020-09-01 10:13:01 Room 1 100.0 10:13:01
Now, the simplest way to do this is to start by upsampling and forward-filling the data.
upsampled = df2.resample("1min").ffill()
upsampled is a huge DataFrame with a value for every second in the range. The forward-filling ensures your occupancy % is carried forward every second until one of your original datapoints said 'it changed here'. After the change, the new value is carried forward to the next datapoint etc.
This is done to ensure we get the necessary time resolution. Normally I would now downsample. You were interested in each hour:
downsampled = upsampled.resample("1h").mean()
By taking the mean, we'll get only the numeric columns in our output, i.e. 'occupancy', and here you'll get the following:
Occupancy %
Timestamp
2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
But you indicated you might want to do this 'per room', so there might be other data with e.g. 'Room 2'. In that case, we have a categorical column, Room name, that we need to group by.
This is a bit harder, because it means we have to group before we upsample, to avoid ambiguity. This is going to create a MultiIndex. We have to collapse the 'group' level of the index, then group and downsample!
grouped = df.groupby("Room name", as_index=False).resample('1s').ffill()
grouped.index = grouped.index.get_level_values(1)
result = grouped.groupby("Room name").resample("1h").mean()
which will look something like this:
Occupancy %
Room name Timestamp
Room 1 2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
Room 2 2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
(I just duplicated the data for Room 1 as Room 2, so the numbers are the same)
For a neat finish, we might unstack this multi-index, pivoting the room names into columns. Then convert those percentages into the nearest number of minutes.
Thus the whole solution is:
import pandas as pd
df = pd.read_excel("<your filename>", parse_dates=[1])
df2 = df.set_index("Timestamp")
# prepend some dummy rows for every different room name
for room_name in df2["Room name"].unique():
df2.loc[pd.Timestamp("09.01.2020 08:00:00")] = (room_name, 0.0, None)
df2.sort_index(inplace=True)
grouped = df.groupby("Room name", as_index=False).resample('1s').ffill()
grouped.index = grouped.index.droplevel(0)
result = (
grouped
.groupby("Room name")
.resample("1h")
.mean()
.unstack(level=0)
.div(100) # % -> fraction
.mul(60) # fraction -> minutes
.astype(int) # nearest number of whole minutes
)
# no longer 'Occupancy %', so drop the label
result.columns = result.columns.droplevel(0)
yielding a result like
Room name Room 1 Room 2
Timestamp
2020-09-01 08:00:00 0 0
2020-09-01 09:00:00 22 22
2020-09-01 10:00:00 60 60
which hopefully is close to what you were after.

As a starting point:
SELECT
room_name, sum(start-stop)
FROM
room_table
WHERE
timestamp BETWEEN 'some_time' AND 'another_time'
GROUP BY
room_name
Where in the above the SQL table is room_table. Also assumes start and stop fields are time types. The 'some_time/another_time` are just placeholders for the time range you are interested in.

Related

How to combine the data from two different dataframes with pyspark?

I have two different (and very large) dataframes (details below). And I need to merge the data from both of them. Since these dataframes are huge (with millions of rows in the first one and thousands in the second one), I was trying to use the AWS EMR service. But I don't quite understand how it is done there and the tutorials I've seen mostly show the instructions for one dataframe only. So, I've been wondering how I can use pyspark for two different dataframes.
Here are the details:
The first dataframe, say df, contains data about the people watching tv on different days. It looks like this:
id date other_data
0 0 2020-01-01 some data
1 1 2020-02-01 some data
2 2 2020-03-01 some data
3 3 2020-04-01 some data
4 4 2020-05-01 some data
Here, id is the id of the watcher, date is the date of watching, other_data contains other information (like the duration of watching, channel, etc.)
The second dataframe, say program, contains data about the programs. It looks like this:
date program start_time end_time
0 2020-01-01 program 1 14:00:00 15:00:00
1 2020-01-01 program 2 15:00:00 16:00:00
2 2020-01-01 program 3 16:00:00 17:00:00
3 2020-01-01 program 4 17:00:00 18:00:00
4 2020-01-01 program 5 18:00:00 19:00:00
Here, date is the date, program is the name of the program, start_time and end_time are the time of the program's beginning and end.
Basically, what I need to do is to create one dataframe that would contain all the information from both of these dataframes. I need this final dataframe to have a separate row for each user and each program. In other words, I need a dataframe that would duplicate each row in the first dataframe for each of the programs on the same day.
It might seem a little bit confusing, but here is an example of the final dataframe I want to receive:
id date other_data program start_time end_time
0 0 2020-01-01 some data program 1 14:00:00 15:00:00
1 0 2020-01-01 some data program 2 15:00:00 16:00:00
2 0 2020-01-01 some data program 3 16:00:00 17:00:00
3 0 2020-01-01 some data program 4 17:00:00 18:00:00
4 0 2020-01-01 some data program 5 18:00:00 19:00:00
As you can see, this final dataframe contains the data for each user and each program that was shown on the same day this user watched tv. In this particular case, the user with id=0 has watched tv on 01/01/2020. On the same day, program 1, program 2, program 3, program 4, and program 5 were shown. So, I need to have one row for each of the programs with their details. And, of course, I need the data from the first dataframe (contained in other_data).
So far, I created the following approach: I iterate over the first dataframe, for each row I find all the rows in the second dataframe that have the same date, merge it and add to the third (final) dataframe.
Here is the code I use:
ids = [] # users' id
dates = [] # dates
other_data = [] # other data from the first dataframe
programs = [] # all programs
start_times = [] # starting times
end_times = [] # ending times
for i, row in df.iterrows():
temp = program.loc[program['date'] == row['date']] # find all programs on the same date
for j, program_row in temp.iterrows(): # iterate over the programs on the same date
# append all the info
ids.append(row['id'])
dates.append(row['date'])
other_data.append(row['other_data'])
programs.append(program_row['program'])
start_times.append(program_row['start_time'])
end_times.append(program_row['end_time'])
# create final dataframe
final = pd.DataFrame({'id': ids, 'date': dates, 'other_data': other_data, 'program': programs,
'start_time': start_times, 'end_time': end_times})
This approach is working, but it is extremely slow (considering the large size of dataframes). I was, therefore, wondering how to split this job between several workers using ERM by AWS. If I understand it correctly, I need to split the first dataframe df between workers, and, at the same time, provide them with the full program dataframe. Is it possible to do that? And how?
Would appreciate any help or advice!
It seems that both df and program are Pandas dataframes and merging/joining is the action needed, see pandas.DataFrame.merge. Try this:
import pandas as pd
finial = pd.merge(df, program, on=['date'], how='inner')
In case the Pandas version is too slow, you could convert the dataframes to PySPark dataframes and perform the following steps:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.appName("convert").getOrCreate()
df_spark = spark.createDataFrame(df)
program_spark = spark.createDataFrame(program)
final_spark = df_spark.join(F.broadcast(program), on=['date'], how='inner')
Here, it is assumed that the dataframe program is a small dataframe - if not, please remove the broadcast.
Hopefully, it is solving your issue and removing the slow loops here.

How to find total time between time intervals defined by start and end columns

I have a pandas DataFrame:
I want to calculate the diffrence between confirm and cancel in the following way:
For date 13.01.2020 and desk_id 1.0 : 10:35:00 – 8:00:00 + 12:36:00 – 11:36:00 + 20:00:00 - 13:36:00
I was able to perform these actions only for a desk with one hour of confirm and cancel. By one hour I mean that in date for desk_id I have only one row for confirm and cancel time. The interesting diff and I get when I subtract from confirm 8:00:00 and from 20:00:00 the cancel time and add them together.
For many hours, I can't put it together. By mamy hour I mean that desk_id in one date have few rows with cancel and confirm time. I would like to choose the date, desk_id and calculate the desk occupancy time - the difference between confirm and cancel for each desk.
Output should looks like:
I would like to find periods of time when a desk is free.
In my data can be many confirms and cancels for desk in one date.
I did it for one hour confirm and cancel:
df_1['confirm'] = pd.to_timedelta(df_1['confirm'].astype(str))
df_1['diff_confirm'] = df_1['confirm'].apply(lambda x: x - datetime.timedelta(days=0, hours=8, minutes=0))
df_1['cancel'] = pd.to_timedelta(df_1['cancel'].astype(str))
df_1['diff_cancel'] = df_1['cancel'].apply(lambda x: datetime.timedelta(days=0, hours=20, minutes=0)-x)
and this works.
Any tips?
You did not make it entirely clear what format you need your results in, but I assume it is okay to put them in a separate dataframe. So this solution operates on each group of rows defined by values of date and desk_id and computes the total time for each group, with output placed in a new dataframe:
Code to create your input dataframe:
from datetime import timedelta
import pandas as pd
df = pd.DataFrame(
{
'date': [pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-13'),
pd.Timestamp('2020-1-13'), pd.Timestamp('2020-1-14'),
pd.Timestamp('2020-1-14'), pd.Timestamp('2020-1-14')],
'desk_id': [1.0, 1.0, 2.0, 1.0, 2.0, 2.0],
'confirm': ['10:36:00', '12:36:00', '09:36:00', '10:36:00', '12:36:00',
'15:36:00'],
'cancel': ['11:36:00', '13:36:00', '11:36:00', '11:36:00', '14:36:00',
'16:36:00']
}
)
Solution:
df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])
# function to compute total time each desk is free
def total_time(df):
return (
(df.iloc[0]['confirm'] - timedelta(days=0, hours=8, minutes=0)) +
(df['confirm'] - df['cancel'].shift()).sum() +
(timedelta(days=0, hours=20, minutes=0) - df.iloc[-1]['cancel'])
)
# apply function to each combination of 'desk_id' and 'date', producing
# a new dataframe
df.groupby(['desk_id', 'date']).apply(total_time).reset_index(name='total_time')
# desk_id date total_time
# 0 1.0 2020-01-13 0 days 10:00:00
# 1 1.0 2020-01-14 0 days 11:00:00
# 2 2.0 2020-01-13 0 days 10:00:00
# 3 2.0 2020-01-14 0 days 09:00:00
The function takes the difference between the first value of confirm and 8:00:00, takes differences between each confirm and preceding cancel values, and then the difference between 20:00:00 and the last value of cancel. Those differences added together to produce the final value.
One guess at what you're trying to do (I still can't fully understand, but here's an attempt):
import pandas as pd
from datetime import timedelta as td
#create the dataframe
a = pd.DataFrame({'data':['2020-01-13','2020-01-13','2020-01-14'],'desk_id':[1.0,1.0,1.0],'confirm':['10:36:00','12:36:00','13:14:00'],'cancel':['11:36:00','13:36:00','13:44:00']})
def get_avail_times(df,start_end_delta=td(hours=12)):
df['confirm'] = pd.to_timedelta(df['confirm'])
df['cancel'] = pd.to_timedelta(df['cancel'])
#group by the two keys so that we can perform calculations on the specific groups!!
df_g = df.groupby(['data','desk_id'], as_index=False).sum()
df_g['total_time'] = start_end_delta - df_g['cancel'] + df_g['confirm']
return df_g.drop('confirm',1).drop('cancel',1)
output = get_avail_times(a)
Which gives the output:
data desk_id total_time
0 2020-01-13 1.0 0 days 10:00:00
1 2020-01-14 1.0 0 days 11:30:00
The key here is to use the .groupby() function which we can then sum together to essentially perform the equation:
total_time = 20:00 + sum_confirm_times - sum_cancel_times - 08:00

Pandas: Group by, Cumsum + Shift with a "where clause"

I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.

pandas - efficiently computing minutely returns as columns on intraday data

I have a DataFrame that looks like such:
closingDate Time Last
0 1997-09-09 2018-12-13 00:00:00 1000
1 1997-09-09 2018-12-13 00:01:00 1002
2 1997-09-09 2018-12-13 00:02:00 1001
3 1997-09-09 2018-12-13 00:03:00 1005
I want to create a DataFrame with roughly 1440 columns labled as timestamps, where the respective daily value is the return over the prior minute:
closingDate 00:00:00 00:01:00 00:02:00
0 1997-09-09 2018-12-13 -0.08 0.02 -0.001 ...
1 1997-09-10 2018-12-13 ...
My issue is that this is a very large DataFrame (several GB), and I need to do this operation multiple times. Time and memory efficiency is key, but time being more important. Is there some vectorized, built in method to do this in pandas?
You can do this with some aggregation and shifting your time series that should result in more efficient calculations.
First aggregate your data by closingDate.
g = df.groupby("closingDate")
Next you can shift your data to offset by a day.
shifted = g.shift(periods=1)
This will create a new dataframe where the Last value will be from the previous minute. Now you can join to your original dataframe based on the index.
df = df.merge(shifted, left_index=True, right_index=True)
This adds the shifted columns to the new dataframe that you can use to do your difference calculation.
df["Diff"] = (df["Last_x"] - df["Last_y"]) / df["Last_y"]
You now have all the data you're looking for. If you need each minute to be its own column you can pivot the results. By grouping the closingDate and then applying the shift you avoid shifting dates across days. If you look at the first observation of each day you'll get a NaN since the values won't be shifted across separate days.

How to impute missing data within a date range in a table?

I have the following problem with imputing the missing or zero values in a table. It seems like it's more of an algorithm problem. I wanted to know if someone could help me out figure this out in python or R.
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 0 01/24/2017 00:00:00
A 0 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 0 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 0 01/12/2017 00:00:00
B 0 01/11/2017 00:00:00
B 0 01/10/2017 00:00:00
B 0 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
for each asset(A,B,etc..) traverse through the records chronologically(date) replace all the zeros with the average of mileage between the points =
(earlier mileage that is not zero - later mileage that is not zero) /
( number of records from the earlier mileage to the later mileage) +
the earlier mileage.
for instance for the above table the data will look like this after it's fixed
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 40,974 01/24/2017 00:00:00
A 40,919 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 39,800 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 4,000 01/12/2017 00:00:00
B 3,500 01/11/2017 00:00:00
B 3,000 01/10/2017 00:00:00
B 2,500 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
in the above case for instance the calculation for one of the records is as below:
(41084-40864)/4(# of records from 40,864 to 41,084) = 110 + previous
value(40,864) = 40919
It seems like you want to be using an analysis method that uses some sort of by to iterate over your data frame and find averages. You could consider something using by() and apply(). The specific iterative changes make it harder without adding in an ordered variable (i.e., right now your rows are implied to be numbered, but should be numbered by date within asset).
Steps to solving this yourself:
Create an ordered variable that provides a number from mileage (0) to mileage (X).
Use either by() or dplyr::group_by() to create averages within each asset. You might want to merge() or dplyr::inner_join() that to the original dataset, or use a lookup.
Use ifelse() to add that average to rows where mileage is 0, multiplying it by the ordered variable.

Categories

Resources