How to impute missing data within a date range in a table? - python

I have the following problem with imputing the missing or zero values in a table. It seems like it's more of an algorithm problem. I wanted to know if someone could help me out figure this out in python or R.
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 0 01/24/2017 00:00:00
A 0 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 0 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 0 01/12/2017 00:00:00
B 0 01/11/2017 00:00:00
B 0 01/10/2017 00:00:00
B 0 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
for each asset(A,B,etc..) traverse through the records chronologically(date) replace all the zeros with the average of mileage between the points =
(earlier mileage that is not zero - later mileage that is not zero) /
( number of records from the earlier mileage to the later mileage) +
the earlier mileage.
for instance for the above table the data will look like this after it's fixed
Asset Mileage Date
-----------------------------------
A 41,084 01/26/2017 00:00:00
A 40,974 01/24/2017 00:00:00
A 40,919 01/23/2017 00:00:00
A 40,864 01/19/2017 00:00:00
A 39,800 01/18/2017 00:00:00
B 5,000 01/13/2017 00:00:00
B 4,000 01/12/2017 00:00:00
B 3,500 01/11/2017 00:00:00
B 3,000 01/10/2017 00:00:00
B 2,500 01/09/2017 00:00:00
B 2,000 01/07/2017 00:00:00
in the above case for instance the calculation for one of the records is as below:
(41084-40864)/4(# of records from 40,864 to 41,084) = 110 + previous
value(40,864) = 40919

It seems like you want to be using an analysis method that uses some sort of by to iterate over your data frame and find averages. You could consider something using by() and apply(). The specific iterative changes make it harder without adding in an ordered variable (i.e., right now your rows are implied to be numbered, but should be numbered by date within asset).
Steps to solving this yourself:
Create an ordered variable that provides a number from mileage (0) to mileage (X).
Use either by() or dplyr::group_by() to create averages within each asset. You might want to merge() or dplyr::inner_join() that to the original dataset, or use a lookup.
Use ifelse() to add that average to rows where mileage is 0, multiplying it by the ordered variable.

Related

Pandas dataframe vectorized bucketing/aggregation?

The Task
I have a dataframe that looks like this:
date
money_spent ($)
meals_eaten
weight
2021-01-01 10:00:00
350
5
140
2021-01-02 18:00:00
250
2
170
2021-01-03 12:10:00
200
3
160
2021-01-04 19:40:00
100
1
150
I want to discretize this so that it "cuts" the rows every $X. I want to know some statistics on how much is being done for every $X i spend.
So if I were to use $500 as a threshold, the first two rows would fall in the first cut, and I could aggregate the remaining columns as follows:
first date of the cut
average meals_eaten
minimum weight
maximum weight
So the final table would be two rows like this:
date
cumulative_spent ($)
meals_eaten
min_weight
max_weight
2021-01-01 10:00:00
600
3.5
140
170
2021-01-03 12:10:00
300
2
150
160
My Approach:
My first instinct is to calculate the cumsum() of the money_spent (assume the data is sorted by date), then I use pd.cut() to basically make a new column, we call it spent_bin, that determines each row's bin.
Note: In this toy example, spent_bin would basically be: [0,500] for the first two rows and (500-1000] for the last two.
Then it's fairly simple, I do a groupby spent_bin then aggregate as follows:
.agg({
'date':'first',
'meals_eaten':'mean',
'returns': ['min', 'max']
})
What I've Tried
import pandas as pd
rows = [
{"date":"2021-01-01 10:00:00","money_spent":350, "meals_eaten":5, "weight":140},
{"date":"2021-01-02 18:00:00","money_spent":250, "meals_eaten":2, "weight":170},
{"date":"2021-01-03 12:10:00","money_spent":200, "meals_eaten":3, "weight":160},
{"date":"2021-01-05 22:07:00","money_spent":100, "meals_eaten":1, "weight":150}]
df = pd.DataFrame.from_dict(rows)
df['date'] = pd.to_datetime(df.date)
df['cum_spent'] = df.money_spent.cumsum()
print(df)
print(pd.cut(df.cum_spent, 500))
For some reason, I can't get the cut step to work. Here is my toy code from above. The labels are not cleanly [0-500], (500,1000] for some reason. Honestly I'd settle for [350,500],(500-800] (this is what the actual cum sum values are at the edges of the cuts), but I can't even get that to work even though I'm doing the exact same as the documentation example. Any help with this?
Caveats and Difficulties:
It's pretty easy to write this in a for loop of course, just do a while cum_spent < 500:. The problem is I have millions of rows in my actual dataset, it currently takes me 20 minutes to process a single df this way.
There's also a minor issue that sometimes rows will break the interval. When that happens, I want that last row included. This problem is in the toy example where row #2 actually ends at $600 not $500. But it is the first row that ends at or surpasses $500, so I'm including it in the first bin.
The customized function to achieve the cumsum with reset limitation
df['new'] = cumli(df['money_spent ($)'].values,500)
out = df.groupby(df.new.iloc[::-1].cumsum()).agg(
date = ('date','first'),
meals_eaten = ('meals_eaten','mean'),
min_weight = ('weight','min'),
max_weight = ('weight','max')).sort_index(ascending=False)
Out[81]:
date meals_eaten min_weight max_weight
new
1 2021-01-01 3.5 140 170
0 2021-01-03 2.0 150 160
from numba import njit
#njit
def cumli(x, lim):
total = 0
result = []
for i, y in enumerate(x):
check = 0
total += y
if total >= lim:
total = 0
check = 1
result.append(check)
return result

Extract minutes per hour based on occupancy in Excel

Is there an easy way to extract the minutes per hour a room was used based on occupancy level? I would like to get an overview of how many minutes room 1 was used from 08:00:00- 08:59:59, 09:00:00-09:59:59..etc
I have done this manually by creating time intervals for every hour starting at fex 08:00:00 and ending at 08:59:59. Then I have used a sumif formula to get number of minutes the room was occupied per hour for one day (9 hours in total per day).
As I want to see how many minutes per hour different rooms are occupied and compare them, I wonder if there is an easier way to do this? It would be great to have a format that I could use for all rooms. However, since all rooms will have different timestamps, this might be difficult?
If anyone knows how to do this in SQL or Python, that would be very helpful as well, especially in SQL!
The link below will give you an example of the data.
In python, the most analogous data structure to a spreadsheet or a SQL table is the DataFrame from the pandas library.
First we can read in data from a spreadsheet like so:
import pandas as pd
df = pd.read_excel("<your filename>", parse_dates=[1])
df["Time"] = df.Timestamp.dt.time
Here I am going to assume you have removed your work-in-progress (table on the right in the image) and that the data is in the first worksheet of the Excel file (otherwise we'll have to pass additional options).
I've ensured that the first (Timestamp) column is correctly understood as containing date-time data. By default it will assume 09.01.2020 ... refers to the 1st of September, American-style - I'm guessing that's what you want; additional options can be passed if you were really referring to the 9th of January (which is how I'd read that date).
I then overwrote the Time column with a time object extracted from the Timestamp, this isn't really necessary but gets the data as close to what was in the spreadsheet as possible. The DataFrame now looks like this:
Timestamp Room name Occupancy % Time
0 2020-09-01 08:04:01 Room 1 0 08:04:01
1 2020-09-01 09:04:01 Room 1 100 09:04:01
2 2020-09-01 09:19:57 Room 1 0 09:19:57
3 2020-09-01 09:48:57 Room 1 0 09:48:57
4 2020-09-01 09:53:01 Room 1 100 09:53:01
5 2020-09-01 10:05:01 Room 1 100 10:05:01
6 2020-09-01 10:08:57 Room 1 100 10:08:57
7 2020-09-01 10:13:01 Room 1 100 10:13:01
(Note for next time, it would have been good to include something like this text in your question, it makes it much easier to construct an answer if the data doesn't have to be painstakingly put together)
Now, there are a lot of things we can do with a DataFrame like this, but I'm going to try and get to where you want to go as directly as possible.
We'll start by using the Timestamp column as the 'index' and prepending a row for the time 08:00:00 because it's not currently part of your dataset, but you indicated you want it.
df2 = df.set_index("Timestamp")
df2.loc[pd.Timestamp("09.01.2020 08:00:00")] = ("Room1", 0.0, None)
df2.sort_index(inplace=True)
The result looks like this:
Room name Occupancy % Time
Timestamp
2020-09-01 08:00:00 Room 1 0.0 None
2020-09-01 08:04:01 Room 1 0.0 08:04:01
2020-09-01 09:04:01 Room 1 100.0 09:04:01
2020-09-01 09:19:57 Room 1 0.0 09:19:57
2020-09-01 09:48:57 Room 1 0.0 09:48:57
2020-09-01 09:53:01 Room 1 100.0 09:53:01
2020-09-01 10:05:01 Room 1 100.0 10:05:01
2020-09-01 10:08:57 Room 1 100.0 10:08:57
2020-09-01 10:13:01 Room 1 100.0 10:13:01
Now, the simplest way to do this is to start by upsampling and forward-filling the data.
upsampled = df2.resample("1min").ffill()
upsampled is a huge DataFrame with a value for every second in the range. The forward-filling ensures your occupancy % is carried forward every second until one of your original datapoints said 'it changed here'. After the change, the new value is carried forward to the next datapoint etc.
This is done to ensure we get the necessary time resolution. Normally I would now downsample. You were interested in each hour:
downsampled = upsampled.resample("1h").mean()
By taking the mean, we'll get only the numeric columns in our output, i.e. 'occupancy', and here you'll get the following:
Occupancy %
Timestamp
2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
But you indicated you might want to do this 'per room', so there might be other data with e.g. 'Room 2'. In that case, we have a categorical column, Room name, that we need to group by.
This is a bit harder, because it means we have to group before we upsample, to avoid ambiguity. This is going to create a MultiIndex. We have to collapse the 'group' level of the index, then group and downsample!
grouped = df.groupby("Room name", as_index=False).resample('1s').ffill()
grouped.index = grouped.index.get_level_values(1)
result = grouped.groupby("Room name").resample("1h").mean()
which will look something like this:
Occupancy %
Room name Timestamp
Room 1 2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
Room 2 2020-09-01 08:00:00 0.000000
2020-09-01 09:00:00 38.194444
2020-09-01 10:00:00 100.000000
(I just duplicated the data for Room 1 as Room 2, so the numbers are the same)
For a neat finish, we might unstack this multi-index, pivoting the room names into columns. Then convert those percentages into the nearest number of minutes.
Thus the whole solution is:
import pandas as pd
df = pd.read_excel("<your filename>", parse_dates=[1])
df2 = df.set_index("Timestamp")
# prepend some dummy rows for every different room name
for room_name in df2["Room name"].unique():
df2.loc[pd.Timestamp("09.01.2020 08:00:00")] = (room_name, 0.0, None)
df2.sort_index(inplace=True)
grouped = df.groupby("Room name", as_index=False).resample('1s').ffill()
grouped.index = grouped.index.droplevel(0)
result = (
grouped
.groupby("Room name")
.resample("1h")
.mean()
.unstack(level=0)
.div(100) # % -> fraction
.mul(60) # fraction -> minutes
.astype(int) # nearest number of whole minutes
)
# no longer 'Occupancy %', so drop the label
result.columns = result.columns.droplevel(0)
yielding a result like
Room name Room 1 Room 2
Timestamp
2020-09-01 08:00:00 0 0
2020-09-01 09:00:00 22 22
2020-09-01 10:00:00 60 60
which hopefully is close to what you were after.
As a starting point:
SELECT
room_name, sum(start-stop)
FROM
room_table
WHERE
timestamp BETWEEN 'some_time' AND 'another_time'
GROUP BY
room_name
Where in the above the SQL table is room_table. Also assumes start and stop fields are time types. The 'some_time/another_time` are just placeholders for the time range you are interested in.

Pandas: Group by, Cumsum + Shift with a "where clause"

I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.
Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.
date player kills
2019-01-01 a 15
2019-01-02 b 20
2019-01-03 a 10
2019-03-04 a 20
Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).
df['sum_kills'] = df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())
This creates the following values:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 25
However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:
date player kills sum_kills
2019-01-01 a 15 NaN
2019-01-02 b 20 NaN
2019-01-03 a 10 15
2019-03-04 a 20 NaN
The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?
I think you are a bit in a pinch using groupby and transform. As explained here, transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby, but you want a different result for each row.
So the best solution I can propose is to use apply without groupy, and perform all the selection by yourself inside the custom function:
def killcount(x, data, timewin):
"""count the player's kills in a time window before the time of current row.
x: dataframe row
data: full dataframe
timewin: a pandas.Timedelta
"""
return data.loc[(data['date'] < x['date']) #select dates preceding current row
& (data['date'] >= x['date']-timewin) #select dates in the timewin
& (data['player'] == x['player'])]['kills'].sum() #select rows with same player
df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)
This returns:
date player kills sum_kills
0 2019-01-01 a 15 0
1 2019-01-02 b 20 0
2 2019-01-03 a 10 15
3 2019-03-04 a 20 0
In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.

pandas - efficiently computing minutely returns as columns on intraday data

I have a DataFrame that looks like such:
closingDate Time Last
0 1997-09-09 2018-12-13 00:00:00 1000
1 1997-09-09 2018-12-13 00:01:00 1002
2 1997-09-09 2018-12-13 00:02:00 1001
3 1997-09-09 2018-12-13 00:03:00 1005
I want to create a DataFrame with roughly 1440 columns labled as timestamps, where the respective daily value is the return over the prior minute:
closingDate 00:00:00 00:01:00 00:02:00
0 1997-09-09 2018-12-13 -0.08 0.02 -0.001 ...
1 1997-09-10 2018-12-13 ...
My issue is that this is a very large DataFrame (several GB), and I need to do this operation multiple times. Time and memory efficiency is key, but time being more important. Is there some vectorized, built in method to do this in pandas?
You can do this with some aggregation and shifting your time series that should result in more efficient calculations.
First aggregate your data by closingDate.
g = df.groupby("closingDate")
Next you can shift your data to offset by a day.
shifted = g.shift(periods=1)
This will create a new dataframe where the Last value will be from the previous minute. Now you can join to your original dataframe based on the index.
df = df.merge(shifted, left_index=True, right_index=True)
This adds the shifted columns to the new dataframe that you can use to do your difference calculation.
df["Diff"] = (df["Last_x"] - df["Last_y"]) / df["Last_y"]
You now have all the data you're looking for. If you need each minute to be its own column you can pivot the results. By grouping the closingDate and then applying the shift you avoid shifting dates across days. If you look at the first observation of each day you'll get a NaN since the values won't be shifted across separate days.

pandas add new column based on iteration through rows

I have a list of transactions for a business.
Example dataframe:
userid date amt start_of_day_balance
123 2017-01-04 10 100.0
123 2017-01-05 20 NaN
123 2017-01-02 30 NaN
123 2017-01-04 40 100.0
The start of day balance is not always retrieved (in that case we receive a NaN). But from the moment that we know the start of day balance for any day, we can accurately estimate the balance after each transaction afterwards.
In this example the new column should look as follows:
userid date amt start_of_day_balance calculated_balance
123 2017-01-04 10 100.0 110
123 2017-01-05 20 NaN 170
123 2017-01-02 30 NaN NaN
123 2017-01-04 40 100.0 150
Note that there is no way to tell the exact order of the transactions that occurred on the same day - I'm happy to overlook that in this case.
My question is how to create this new column. Something like:
df['calculated_balance'] = df.sort_values(['date']).groupby(['userid'])\
['amt'].cumsum() + df['start_of_day_balance'].min()
wouldn't work because of the NaNs.
I also don't want to filter out any transactions that happened before the first recorded start of day balance.
I came up with a solution that seems to work. I'm not sure how elegant it is.
def calc_estimated_balance(g):
# find the first date which has a start of day balance
first_date_with_bal = g.loc[g['start_of_day_balance'].first_valid_index(), 'date']
# only calculate the balance if date is greater than or equal to the date of the first balance
g['calculated_balance'] = g[g['date'] >= first_date_with_bal]['amt'].cumsum().add(g['start_of_day_balance'].min())
return g
df = df.sort_values(['date']).groupby(['userid']).apply(calc_estimated_balance)

Categories

Resources