Upsampling and dividing data in pandas - python

I am trying to upsample a pandas datetime-indexed dataframe, so that resulting data is equally divided over the new entries.
For instance, let's say I have a dataframe which stores a cost each month, and I want to get a dataframe which summarizes the equivalent costs per day for each month:
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
)
.set_index("time")
)
Daily costs are 1$ (or whatever currency you like) in January, and 0.5$ in February. My goal in picture:
After a lot of struggle, I managed to obtain the next code snippet which seems to do what I want:
# add a value to perform a correct resampling
df.loc[df.index.max() + relativedelta(months=1)] = 0
# forward-fill over the right scale
# then divide each entry per the number of rows in the month
df = (df
.resample('1d')
.ffill()
.iloc[:-1]
.groupby(lambda x: datetime(x.year, x.month, 1))
.transform(lambda x: (x / x.count()))
)
However, this is not entirely ok:
using transform forces me to have dataframes with a single column ;
I need to hardcode my original frequency several times in different formats (while adding an extra value at the end of the dataframe, and in the groupby), making a function design hard ;
It only works with evenly-spaced datetime index (even if it's ok in my case) ;
it remains complex.
Does anyone have a suggestion to improve that code snippet ?

What if we took df's month indices and expanded them into days range, while dividing df's values by a number those days and assigning to each day, all by list comprehensions (edit: for equally distributed values per day):
import pandas as pd
# initial DataFrame
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31],
[pd.to_datetime('2023-02-01'), 14]],
columns=['time', 'cost']
).set_index("time"))
# reformat to months
df.index = df.index.strftime('%m-%Y')
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
[v / pd.Period(i).days_in_month # each month's value divided by n of days in a month
for d in range(pd.Period(i).days_in_month)], # repeated for as many times as there are days
index=pd.date_range(start=i, periods=pd.Period(i).days_in_month, freq='D')) # days range
for i, v in df.iterrows()]) # for each df's index and value
df1
Output:
cost
2023-01-01 1.0
2023-01-02 1.0
2023-01-03 1.0
2023-01-04 1.0
2023-01-05 1.0
2023-01-06 1.0
2023-01-07 1.0
2023-01-08 1.0
2023-01-09 1.0
2023-01-10 1.0
2023-01-11 1.0
... ...
2023-02-13 0.5
2023-02-14 0.5
2023-02-15 0.5
2023-02-16 0.5
2023-02-17 0.5
2023-02-18 0.5
2023-02-19 0.5
2023-02-20 0.5
2023-02-21 0.5
2023-02-22 0.5
2023-02-23 0.5
2023-02-24 0.5
2023-02-25 0.5
2023-02-26 0.5
2023-02-27 0.5
2023-02-28 0.5
What could be done to avoid uniform distribution of daily costs and for the cases with multiple columns? Here's an extended df:
# additional columns and a row
df = (pd.DataFrame([[pd.to_datetime('2023-01-01'), 31, 62, 23],
[pd.to_datetime('2023-02-01'), 14, 28, 51],
[pd.to_datetime('2023-03-01'), 16, 33, 21]],
columns=['time', 'cost1', 'cost2', 'cost3']
).set_index("time"))
# reformat to months
df.index = df.index.strftime('%m-%Y')
df
Output:
cost1 cost2 cost3
time
01-2023 31 62 23
02-2023 14 28 51
03-2023 16 33 21
Here's what I came up for the cases where monthly costs may be upsampled by randomized daily costs, inspired by this question. This solution is scalable to the number of columns and rows:
df1 = pd.concat( # concatenate the resulted DataFrames into one
[pd.DataFrame( # make a DataFrame from a row in df
# here we make a Series with random Dirichlet distributed numbers
# with length of a month and a column's value as the sum
[pd.Series((np.random.dirichlet(np.ones(pd.Period(i).days_in_month), size=1)*v
).flatten()) # the product is an ndarray that needs flattening
for v in row], # for every column value in a row
# index renamed as columns because of the created DataFrame's shape
index=df.columns
# transpose and set the proper index
).T.set_index(
pd.date_range(start=i,
periods=pd.Period(i).days_in_month,
freq='D'))
for i, row in df.iterrows()]) # iterate over every row
Output:
cost1 cost2 cost3
2023-01-01 1.703177 1.444117 0.160151
2023-01-02 0.920706 3.664460 0.823405
2023-01-03 1.210426 1.194963 0.294093
2023-01-04 0.214737 1.286273 0.923881
2023-01-05 1.264553 0.380062 0.062829
... ... ... ...
2023-03-27 0.124092 0.615885 0.251369
2023-03-28 0.520578 1.505830 1.632373
2023-03-29 0.245154 3.094078 0.308173
2023-03-30 0.530927 0.406665 1.149860
2023-03-31 0.276992 1.115308 0.432090
90 rows × 3 columns
To assert the monthly sum:
df1.groupby(pd.Grouper(freq='M')).agg('sum')
Output:
cost1 cost2 cost3
2023-01-31 31.0 62.0 23.0
2023-02-28 14.0 28.0 51.0
2023-03-31 16.0 33.0 21.0

Related

Create new Row in Data Frame with ID and date if ID and date do not exist in "x" timeframe [duplicate]

My data can have multiple events on a given date or NO events on a date. I take these events, get a count by date and plot them. However, when I plot them, my two series don't always match.
idx = pd.date_range(df['simpleDate'].min(), df['simpleDate'].max())
s = df.groupby(['simpleDate']).size()
In the above code idx becomes a range of say 30 dates. 09-01-2013 to 09-30-2013
However S may only have 25 or 26 days because no events happened for a given date. I then get an AssertionError as the sizes dont match when I try to plot:
fig, ax = plt.subplots()
ax.bar(idx.to_pydatetime(), s, color='green')
What's the proper way to tackle this? Do I want to remove dates with no values from IDX or (which I'd rather do) is add to the series the missing date with a count of 0. I'd rather have a full graph of 30 days with 0 values. If this approach is right, any suggestions on how to get started? Do I need some sort of dynamic reindex function?
Here's a snippet of S ( df.groupby(['simpleDate']).size() ), notice no entries for 04 and 05.
09-02-2013 2
09-03-2013 10
09-06-2013 5
09-07-2013 1
You could use Series.reindex:
import pandas as pd
idx = pd.date_range('09-01-2013', '09-30-2013')
s = pd.Series({'09-02-2013': 2,
'09-03-2013': 10,
'09-06-2013': 5,
'09-07-2013': 1})
s.index = pd.DatetimeIndex(s.index)
s = s.reindex(idx, fill_value=0)
print(s)
yields
2013-09-01 0
2013-09-02 2
2013-09-03 10
2013-09-04 0
2013-09-05 0
2013-09-06 5
2013-09-07 1
2013-09-08 0
...
A quicker workaround is to use .asfreq(). This doesn't require creation of a new index to call within .reindex().
# "broken" (staggered) dates
dates = pd.Index([pd.Timestamp('2012-05-01'),
pd.Timestamp('2012-05-04'),
pd.Timestamp('2012-05-06')])
s = pd.Series([1, 2, 3], dates)
print(s.asfreq('D'))
2012-05-01 1.0
2012-05-02 NaN
2012-05-03 NaN
2012-05-04 2.0
2012-05-05 NaN
2012-05-06 3.0
Freq: D, dtype: float64
One issue is that reindex will fail if there are duplicate values. Say we're working with timestamped data, which we want to index by date:
df = pd.DataFrame({
'timestamps': pd.to_datetime(
['2016-11-15 1:00','2016-11-16 2:00','2016-11-16 3:00','2016-11-18 4:00']),
'values':['a','b','c','d']})
df.index = pd.DatetimeIndex(df['timestamps']).floor('D')
df
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-18 "2016-11-18 04:00:00" d
Due to the duplicate 2016-11-16 date, an attempt to reindex:
all_days = pd.date_range(df.index.min(), df.index.max(), freq='D')
df.reindex(all_days)
fails with:
...
ValueError: cannot reindex from a duplicate axis
(by this it means the index has duplicates, not that it is itself a dup)
Instead, we can use .loc to look up entries for all dates in range:
df.loc[all_days]
yields
timestamps values
2016-11-15 "2016-11-15 01:00:00" a
2016-11-16 "2016-11-16 02:00:00" b
2016-11-16 "2016-11-16 03:00:00" c
2016-11-17 NaN NaN
2016-11-18 "2016-11-18 04:00:00" d
fillna can be used on the column series to fill blanks if needed.
An alternative approach is resample, which can handle duplicate dates in addition to missing dates. For example:
df.resample('D').mean()
resample is a deferred operation like groupby so you need to follow it with another operation. In this case mean works well, but you can also use many other pandas methods like max, sum, etc.
Here is the original data, but with an extra entry for '2013-09-03':
val
date
2013-09-02 2
2013-09-03 10
2013-09-03 20 <- duplicate date added to OP's data
2013-09-06 5
2013-09-07 1
And here are the results:
val
date
2013-09-02 2.0
2013-09-03 15.0 <- mean of original values for 2013-09-03
2013-09-04 NaN <- NaN b/c date not present in orig
2013-09-05 NaN <- NaN b/c date not present in orig
2013-09-06 5.0
2013-09-07 1.0
I left the missing dates as NaNs to make it clear how this works, but you can add fillna(0) to replace NaNs with zeroes as requested by the OP or alternatively use something like interpolate() to fill with non-zero values based on the neighboring rows.
Here's a nice method to fill in missing dates into a dataframe, with your choice of fill_value, days_back to fill in, and sort order (date_order) by which to sort the dataframe:
def fill_in_missing_dates(df, date_col_name = 'date',date_order = 'asc', fill_value = 0, days_back = 30):
df.set_index(date_col_name,drop=True,inplace=True)
df.index = pd.DatetimeIndex(df.index)
d = datetime.now().date()
d2 = d - timedelta(days = days_back)
idx = pd.date_range(d2, d, freq = "D")
df = df.reindex(idx,fill_value=fill_value)
df[date_col_name] = pd.DatetimeIndex(df.index)
return df
You can always just use DataFrame.merge() utilizing a left join from an 'All Dates' DataFrame to the 'Missing Dates' DataFrame. Example below.
# example DataFrame with missing dates between min(date) and max(date)
missing_df = pd.DataFrame({
'date':pd.to_datetime([
'2022-02-10'
,'2022-02-11'
,'2022-02-14'
,'2022-02-14'
,'2022-02-24'
,'2022-02-16'
])
,'value':[10,20,5,10,15,30]
})
# first create a DataFrame with all dates between specified start<-->end using pd.date_range()
all_dates = pd.DataFrame(pd.date_range(missing_df['date'].min(), missing_df['date'].max()), columns=['date'])
# from the all_dates DataFrame, left join onto the DataFrame with missing dates
new_df = all_dates.merge(right=missing_df, how='left', on='date')
s.asfreq('D').interpolate().asfreq('Q')

Pandas: Annualized Returns

I have a dataframe with quarterly returns of financial entities and I want to calculate 1, 3, 5 10-year annualized returns. The formula for calculating annualized returns is:
R = product(1+r)^(4/N) -1
r are the quarterly return of an entity, N is the number of quarters
for example 3-year annualized return is:
R_3yr = product(1+r)^(4/12) -1 = ((1+r1)*(1+r2)*(1+r3)*...*(1+r12))^(1/3) -1
r1, r2, r3 ... r12 are the quarterly returns of the previous 11 quarters plus current quarter.
I created a code which provides the right results but it is very slow because it is looping through each row of the dataframe. The code below is an extract of my code for 1-year and 3-year annualized retruns (I applied the same concept for 5, 7, 10, 15 and 20-year returns). r_qrt is the field with the quarterly returns
import pandas as pd
import numpy as np
#create dataframe where I append the results
df_final = pd.DataFrame()
columns=['Date','Entity','r_qrt','R_1yr','R_3yr']
#loop thorugh the dataframe
for row in df.itertuples():
R_1yr=np.nan #1-year annualized return
R_3yr=np.nan #3-year annualized return
#Calculate 1 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-1)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=4:
b=(1+(temp_table.r_qrt))[-4:].product()
R_1yr=(b-1)
#Calculate 3 YR Annualized Return
date_previous_period=row.Date+ pd.DateOffset(years=-3)
temp_table=df.loc[(df['Date']>date_previous_period) &
(df['Date']<=row.Date) &
(df['Entity']==row.Entity)]
if temp_table['r_qrt'].count()>=12:
b=(1+(temp_table.r_qrt))[-12:].product()
R_3yr=((b**(1/3))-1)
d=[row.Date,row.Entity,row.r_qrt,R_1yr,R_3yr]
df_final = df_final.append(pd.Series(d, index=columns), ignore_index=True)
df_final looks as below (only reporting 1-year return results for space limitations)
Date
Entity
r_qrt
R_1yr
2015-03-31
A
0.035719
NaN
2015-06-30
A
0.031417
NaN
2015-09-30
A
0.030872
NaN
2015-12-31
A
0.029147
0.133335
2016-03-31
A
0.022100
0.118432
2016-06-30
A
0.020329
0.106408
2016-09-30
A
0.017676
0.092245
2016-12-31
A
0.017304
0.079676
2015-03-31
B
0.034705
NaN
2015-06-30
B
0.037772
NaN
2015-09-30
B
0.036726
NaN
2015-12-31
B
0.031889
0.148724
2016-03-31
B
0.029567
0.143020
2016-06-30
B
0.028958
0.133312
2016-09-30
B
0.028890
0.124746
2016-12-31
B
0.030389
0.123110
I am sure there is a more efficient way to run the same calculations but I have not been able to find it. My code is not efficient and takes more than 2 hours for large dataframes with long time series and many entities.
Thanks
see (https://www.investopedia.com/terms/a/annualized-total-return.asp) for the definition of annualized return
data=[ 3, 7, 5, 12, 1]
def annualize_rate(data):
retVal=0
accum=1
for item in data:
print(1+(item/100))
accum*=1+(item/100)
retVal=pow(accum,1/len(data))-1
return retVal
print(annualize_rate(data))
output
0.05533402290765199
2015 (a and b)
data=[0.133335,0.148724]
print(annualize_rate(data))
output:
0.001410292043902306
2016 (a&b)
data=[0.079676,0.123110]
print(annualize_rate(data))
output
0.0010139064424810051
you can store each year annualized value then use pct_chg to get a 3 year result
data=[0.05,0.06,0.07]
df=pd.DataFrame({'Annualized':data})
df['Percent_Change']=df['Annualized'].pct_change().fillna(0)
amount=1
returns_plus_one=df['Percent_Change']+1
cumulative_return = returns_plus_one.cumprod()
df['Cumulative']=cumulative_return.mul(amount)
df['2item']=df['Cumulative'].rolling(window=2).mean().plot()
print(df)
For future reference of other users, this is the new version of the code that I implemented following Golden Lion suggestion:
def compoundfunct(arr):
return np.product(1+arr)**(4/len(arr)) - 1
# 1-yr annulized return
df["R_1Yr"]=df.groupby('Entity')['r_qrt'].rolling(4).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
# 3-yr annulized return
df["R_3Yr"]=df.groupby('Entity')['r_qrt'].rolling(12).apply(compoundfunct).groupby('Entity').shift(0).reset_index().set_index('level_1').drop('Entity',axis=1)
The performance of the previous code was 36.4 sec for a dataframe of 5,640 rows. The new code is more than 10x faster, it took 2.8 sec
One of the issues with this new code is that one has to make sure that rows are sorted by group (Entity in my case) and date before running the calculations, otherwise results could be wrong.
Thanks,
S.

Python: Compare data against the 95th percentile of a running window dataset

I have a large DataFrame of thousands of rows but only 2 columns. The 2 columns are of the below format:
Dt
Val
2020-01-01
10.5
2020-01-01
11.2
2020-01-01
10.9
2020-01-03
11.3
2020-01-05
12.0
The first column is date and the second column is a value. For each date, there may be zero, one or more values.
What I need to do is the following: Compute the 95th percentile based on the 30 days that just past and see if the current value is above or below that 95th percentile value. There must however be a minimum of 50 values available for the past 30 days.
For example, if a record has date "2020-12-01" and value "10.5", then I need to first see how many values are there available for the date range 2020-11-01 to 2020-11-30. If there are at least 50 values available over that date range, then I will want to compute the 95th percentile of those values and compare 10.5 against that. If 10.5 is greater than the 95th percentile value, then the result for that record is "Above Threshold". If 10.5 is less than the 95th percentile value, then the result for that record is "Below Threshold". If there are less than 50 values over the date range 2020-11-01 to 2020-11-30, then the result for that record is "Insufficient Data".
I would like to avoid running a loop if possible as it may be expensive from a resource and time perspective to loop through thousands of records to process them one by one. I hope someone can advise of a simple(r) python / pandas solution here.
Use rolling on DatetimeIndex to get the number of values available and the 95th percentile in the last 30 days. Here is an example with 3 days rolling window:
import datetime
import pandas as pd
df = pd.DataFrame({'val':[1,2,3,4,5,6]},
index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])
df.index = pd.DatetimeIndex(df.index)
df['number_of_values'] = df.rolling('3D').count()
df['rolling_percentile'] = df.rolling('3D')['val'].quantile(0.9, interpolation='nearest')
Then you can simply do your comparison:
# Above Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)
# Below Threshold
(df['val']>df['rolling_percentile'])&(df['number_of_values']>=50)
# Insufficient Data
df['number_of_values']<50
To remove the current date, close argument would not work for more than one row on a day, so maybe use the rolling apply:
def f(x, metric):
x = x[x.index!=x.index[-1]]
if metric == 'count':
return len(x)
elif metric == 'percentile':
return x.quantile(0.9, interpolation='nearest')
else:
return np.nan
df = pd.DataFrame({'val':[1,2,3,4,5,6]},
index = [datetime.date(2020,10,1), datetime.date(2020,10,1), datetime.date(2020,10,2),
datetime.date(2020,10,3), datetime.date(2020,10,3), datetime.date(2020,10,4)])
df.index = pd.DatetimeIndex(df.index)
df['count'] = df.rolling('3D')['val'].apply(f, args = ('count',))
df['percentile'] = df.rolling('3D')['val'].apply(f, args = ('percentile',))
val count percentile
2020-10-01 1 0.0 NaN
2020-10-01 2 0.0 NaN
2020-10-02 3 2.0 2.0
2020-10-03 4 3.0 3.0
2020-10-03 5 3.0 3.0
2020-10-04 6 3.0 5.0

Boxplot of Multiindex df

I want to do 2 things:
I want to create one boxplot per date/day with all the values for MeanTravelTimeSeconds in that date. The number of MeanTravelTimeSeconds elements varies from date to date (e.g. one day might have a count of 300 values while another, 400).
Also, I want to transform the rows in my multiindex series into columns because I don't want the rows to repeat every time. If it stays like this I'd have tens of millions of unnecessary rows.
Here is the resulting series after using df.stack() on a df indexed by date (date is a datetime object index):
Date
2016-01-02 NumericIndex 1611664
OriginMovementID 4744
DestinationMovementID 5084
MeanTravelTimeSeconds 1233
RangeLowerBoundTravelTimeSeconds 756
...
2020-03-31 DestinationMovementID 3594
MeanTravelTimeSeconds 1778
RangeLowerBoundTravelTimeSeconds 1601
RangeUpperBoundTravelTimeSeconds 1973
DayOfWeek Tuesday
Length: 11281655, dtype: object
When I use seaborn to plot the boxplot I guet a bucnh of errors after playing with different selections.
If I try to do df.stack().unstack() or df.stack().T I get then following error:
Index contains duplicate entries, cannot reshape
How do I plot the boxplot and how do I turn the rows into columns?
You really do need to make your index unique to make the functions you want to work. I suggest a sequential number that resets at every change in the other two key columns.
import datetime as dt
import random
import numpy as np
cat = ["NumericIndex","OriginMovementID","DestinationMovementID","MeanTravelTimeSeconds",
"RangeLowerBoundTravelTimeSeconds"]
df = pd.DataFrame(
[{"Date":d, "Observation":cat[random.randint(0,len(cat)-1)],
"Value":random.randint(1000,10000)}
for i in range(random.randint(5,20))
for d in pd.date_range(dt.datetime(2016,1,2), dt.datetime(2016,3,31), freq="14D")])
# starting point....
df = df.sort_values(["Date","Observation"]).set_index(["Date","Observation"])
# generate an array that is sequential within change of key
seq = np.full(df.index.shape, 0)
s=0
p=""
for i, v in enumerate(df.index):
if i==0 or p!=v: s=0
else: s+=1
seq[i] = s
p=v
df["SeqNo"] = seq
# add to index - now unstack works as required
dfdd = df.set_index(["SeqNo"], append=True)
dfdd.unstack(0).loc["MeanTravelTimeSeconds"].boxplot()
print(dfdd.unstack(1).head().to_string())
output
Value
Observation DestinationMovementID MeanTravelTimeSeconds NumericIndex OriginMovementID RangeLowerBoundTravelTimeSeconds
Date SeqNo
2016-01-02 0 NaN NaN 2560.0 5324.0 5085.0
1 NaN NaN 1066.0 7372.0 NaN
2016-01-16 0 NaN 6226.0 NaN 7832.0 NaN
1 NaN 1384.0 NaN 8839.0 NaN
2 NaN 7892.0 NaN NaN NaN

Pandas changing cell values based on another cell

I am currently formatting data from two different data sets.
One of the dataset reflects an observation count of people in room on hour basis, the second one is a count of people based on wifi logs generated in 5 minutes interval.
After merging these two dataframes into one, I run into the issue where each hour (as "10:00:00") has the data from the original set, but the other data (every 5min like "10:47:14") does not include this data.
Here is how the merge dataframe looks:
room time con auth capacity % Count module size
0 B002 Mon Nov 02 10:32:06 23 23 90 NaN NaN NaN NaN`
1 B002 Mon Nov 02 10:37:10 25 25 90 NaN NaN NaN NaN`
12527 B002 Mon Nov 02 10:00:00 NaN NaN 90 50% 45.0 COMP30520 60`
12528 B002 Mon Nov 02 11:00:00 NaN NaN 90 0% 0.0 COMP30520 60`
Is there a way for me to go through the dataframe and find all the information regarding the "occupancy", "occupancyCount", "module" and "size" from 11:00:00 and write it to all the cells that are of the same day and where the hour is between 10:00:00 and 10:59:59?
That would allow me to have all the information on each row and then allow me to gather the min(), max() and median() based on 'day' and 'hour'.
To answer the comment for the original dataframes, here there are:
first dataframe:
time room module size
0 Mon Nov 02 09:00:00 B002 COMP30190 29
1 Mon Nov 02 10:00:00 B002 COMP40660 53
second dataframe:
room time con auth capacity % Count
0 B002 Mon Nov 02 20:32:06 0 0 NaN NaN NaN
1 B002 Mon Nov 02 20:37:10 0 0 NaN NaN NaN
2 B002 Mon Nov 02 20:42:12 0 0 NaN NaN NaN
12797 B008 Wed Nov 11 13:00:00 NaN NaN 40 25 10.0
12798 B008 Wed Nov 11 14:00:00 NaN NaN 40 50 20.0
12799 B008 Wed Nov 11 15:00:00 NaN NaN 40 25 10.0
this is how these two dataframes were merged together:
DFinal = pd.merge(DF, d3, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)
Any help with this would be greatly appreciated.
Thanks a lot,
-Romain
Somewhere to start:
b = df[(df['time'] > X) & (df['time'] < Y)]
selects all the elements within times X and Y
And then
df.loc[df['column_name'].isin(b)]
Gives you the rows you want (ie - between X and Y) and you can just assign as you see fit.
I think you'll want to assign the values of the selected rows to those of row number X?
Hope that helps.
Note that these function are cut and paste jobs from
[1] Filter dataframe rows if value in column is in a set list of values
[2] Select rows from a DataFrame based on values in a column in pandas
If I understood it correctly, you want to fill all the missing values in your merged dataframe with the corresponding closest data point available in the given hour. I did something similar in essence in the past using a variate of pandas.cut for timeseries but I can't seem to find it, it wasn't really nice anyways.
While I'm not entirely sure, fillna method of the pandas dataframe might be what you want (docs here).
Let your two dataframes be named df_hour and df_cinq, you merged them like this:
df = pd.merge(df_hour, df_cinq, left_on=["room", "time"], right_on=["room", "time"], how="outer", left_index=False, right_index=False)
Then you change your index to time and sort it:
df.set_index('time',inplace=True)
df.sort_index(inplace=True)
The fillna method has an option called 'method' that can have these values (2):
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value
Using it to do forward filling (i.e. missing values are filled with the preceding value in the frame):
df.fillna(method='ffill', inplace=True)
The problem with this on your data is that all of the missing data in the non-working hours belonging to the 5-minute observations will be filled with outdated data points. You can use the limit option to limit the amount of consecutive data points to be filled but I don't know if it's useful to you.
Here's a complete script I wrote as a toy example:
import pandas as pd
import random
hourly_count = 8 #workhours
cinq_count = 24 * 12 # 1day
hour_rng = pd.date_range('1/1/2016-09:00:00', periods = hourly_count, freq='H')
cinq_rng = pd.date_range('1/1/2016-00:02:53', periods = cinq_count,
freq='5min')
roomz = 'room0 room1 secretroom'.split()
hourlydata = {'col1': [], 'col2': [], 'room': []}
for i in range(hourly_count):
hourlydata['room'].append(random.choice(roomz))
hourlydata['col1'].append(random.random())
hourlydata['col2'].append(random.randint(0,100))
cinqdata = {'col3': [], 'col4': [], 'room': []}
frts = 'apples oranges peaches grapefruits whatmore'.split()
vgtbls = 'onion1 onion2 onion3 onion4 onion5 onion0'.split()
for i in range(cinq_count):
cinqdata['room'].append(random.choice(roomz))
cinqdata['col3'].append(random.choice(frts))
cinqdata['col4'].append(random.choice(vgtbls))
hourlydf = pd.DataFrame(hourlydata)
hourlydf['time'] = hour_rng
cinqdf = pd.DataFrame(cinqdata)
cinqdf['time'] = cinq_rng
df = pd.merge(hourlydf, cinqdf, left_on=['room','time'], right_on=['room',
'time'], how='outer', left_index=False, right_index=False)
df.set_index('time',inplace=True)
df.sort_index(inplace=True)
df.fillna(method='ffill', inplace=True)
print(df['2016-1-1 09:00:00':'2016-1-1 17:00:00'])
Actually I was able to fix this by:
First: using partition on "time" feature in order to generate two additional columns, one for the day showed in "time" and one for the hour in the "time" column.
I used the lambda functions to get these columns:
df['date'] = df['date'].map(lambda x: x[10:-6])
df['time'] = df['time'].map(lambda x: x[8:-8])
Based on these two new columns I modified the way the dataframes were being merged.
here is the code I used to fix it:
dataframeFinal = pd.merge(dataframe1, dataframe2, left_on=["room", "date", "hour"],
right_on=["room", "date", "hour"], how="outer",
left_index=False, right_index=False, copy=False)
After this merge I ended up having duplicate time columns ('time_y' and "time_x').
So I replaced the NaN values as follows:
dataframeFinal.time_y.fillna(dataframeFinal.time_x, inplace=True)
Now the column "time_y" contains all the time values, no more NaN.
I do not need the "time_x" column so I drop it from the dataframe
dataframeFinal = dataframeFinal.drop('time_x', axis=1)

Categories

Resources