Related
I'm attempting to get the mean values on one data frame between certain time points that are marked as events in a second data frame.
This is a follow up to this question, where now I have missing/NaN values: Find a subset of columns based on another dataframe?
import pandas as pd
import numpy as np
#example
example_g = [["4/20/21 4:20", 302, 0, np.NaN, np.NaN, np.NaN, np.NaN, np.NaN],
["2/17/21 9:20",135, 1, 1.4, 1.8, 2, 8, 10],
["2/17/21 9:20", 111, 4, 5, 5.1, 5.2, 5.3, 5.4]]
example_g_table = pd.DataFrame(example_g,columns=['Date_Time','CID', 0.0, 0.1, 0.2, 0.3, 0.4, 0.5])
#Example Timestamps
example_s = [["4/20/21 4:20",302,0.0, 0.2, np.NaN],
["2/17/21 9:20",135,0.0, 0.1, 0.4 ],
["2/17/21 9:20",111,0.3, 0.4, 0.5 ]]
example_s_table = pd.DataFrame(example_s,columns=['Date_Time','CID', "event_1", "event_2", "event_3"])
df = pd.merge(left=example_g_table,right=example_s_table,on=['Date_Time','CID'],how='left')
def func(df):
event_2 = df['event_2']
event_3 = df['event_3']
start = event_2 + 2 # this assumes that the column called 0 will be the third (and starting at 0, it'll be the called 2), column 1 will be the third column, etc
end = event_3 + 2 # same as above
total = sum(df.iloc[start:end+1]) # this line is the key. It takes the sum of the values of columns in the range of start to finish
avg = total/(end-start+1) #(end-start+1) gets the count of things in our range
return avg
df['avg'] = df.apply(func,axis=1)
I get the following error:
cannot do positional indexing on Index with these indexers [nan] of type float
I have attempted making sure that columns are floats and have tried removing the int() command within the definitions of the events.
How can I preform the same calculations as before where possible but while skipping any values that are NaN?
so about your question, check if this solution is ok:
def func(row):
try:
event_2 = row['event_2']
event_3 = row['event_3']
start = int(event_2 + 2)
end = int(event_3 + 2)+1
list_row = row.tolist()[start:end]
list_row = [x for x in list_row if x == x]
return sum(list_row)/(end-start)
except Exception as e:
return np.NaN
df['avg'] = df.apply(lambda x: func(x),axis=1)
I reduced the function and convert start and end parameters to integer before to the set a subset and when you call the function interows I'm using a lambda function and in Avg calculation, I remove all NaN values
You can check if the event values are NaN and if any of the event value is NaN, just return NaN from the function, else return the required value.
You can also modify the function a bit to calculate the values between any two given events, i.e. not necessarily event 2, and event 3. Also, the data you provided in the previous question had event values columns as integer, but this time, you have float values like 0.1, 0.2, 0.3, ... etc. You can just store the column for event values in a list in an increasing order to be able to access them via index values coming from events column from the second dataframe.
Additionally, you can directly use np.mean instead of calculating the sum and dividing it manually. The modified version of the function will look like this:
eventCols = [0.0, 0.1, 0.2, 0.3, 0.4, 0.5] # Columns having the value for events
def getMeanValue(row, eN1=2, eN2=3):
if pd.isna([row[f'event_{eN1}'], row[f'event_{eN2}']]).any():
return float('nan')
else:
requiredEventCols =eventCols[int(row[f'event_{eN1}']):int(row[f'event_{eN2}']+1)]
return np.mean(row[requiredEventCols])
Now, you can apply this function on the dataframe on axis=1=:
df['avg'] = df.apply(getMeanValue,axis=1)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 3.30
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.35
[3 rows x 12 columns]
Additionally, if needed, you can also pass the two event numbers, default values are 2, and 3 which means the value will be calculated for event_2, and event_3
Average between event_1 and event_2:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=2)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN 0.00
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 1.20
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.25
[3 rows x 12 columns]
Average between event_1 and event_3:
df['avg'] = df.apply(getMeanValue,axis=1, eN1=1, eN2=3)
Date_Time CID 0.0 0.1 0.2 ... 0.5 event_1 event_2 event_3 avg
0 4/20/21 4:20 302 0 NaN NaN ... NaN 0 2 NaN NaN
1 2/17/21 9:20 135 1 1.4 1.8 ... 10.0 0 1 4.0 2.84
2 2/17/21 9:20 111 4 5.0 5.1 ... 5.4 3 4 5.0 5.30
[3 rows x 12 columns]
The format of your data is hard to work with. I would spend some time to rearrange it into a less wide format, then do the work needed.
Here is a quick example, but I did not spend any time making this readable:
base = example_g_table.set_index(['Date_Time','CID']).stack().to_frame()
data = example_s_table.set_index(['Date_Time','CID']).stack().reset_index().set_index(['Date_Time','CID', 0])
base['events'] = data
base = base.reset_index()
base = base.rename(columns={'level_2': 'local_index', 0: 'values'})
This produces a frame that looks something like this:
In this format calculating the result is not so hard.
import numpy
from functools import partial
def mean_two_events(event1, event2, columns_to_mean, df):
event_1 = df['events'] == event1
event_2 = df['events'] == event2
if any(event_1) and any(event_2):
return df.loc[event_1.idxmax():event_2.idxmax()][columns_to_mean].mean()
else:
return np.nan
mean_event2_and_event3 = partial(mean_two_events, 'event_2','event_3', 'values')
mean_event1_and_event3 = partial(mean_two_events, 'event_1','event_3', 'values')
base.groupby(['Date_Time','CID']).apply(mean_event2_and_event3).reset_index()
Good luck!
Edit:
Here is an alternative solution that filters out the values BEFORE the groupby.
base['events'] = base.groupby(['Date_Time','CID']).events.ffill()
# This caluclates all periods up until the next event. The shift makes the first values of the next event included as well.
# The problem with appoach is that more complex logic will be needed if you need to caluclate values between events that
# are not adjasant, IE this wont work if you want the calculate between event_1 and event_3.
base['time_periods_to_include'] = ((base.events == 'event_2') | (base.groupby(['Date_Time','CID']).events.shift() == 'event_2'))
# Now we can simply do:
filtered_base = base[base['time_periods_to_include']]
filtered_base.groupby(['Date_Time','CID']).values.mean()
# The benifit is that you can now eaisaly do:
filtered_base.groupby(['Date_Time','CID']).values.rolling(5).mean()
I'm trying to calculate a rolling average on some incomplete data. I want to average values in column 2 across windows of size 1.0 of the value in column 1 (miles). I've tried .rolling(), but (from my limited understanding) this only creates windows based on the index, and not on column values.
import pandas as pd
import numpy as np
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
])
averages = []
for index in range(len(df)):
nearby = df.loc[np.abs(df[0] - df.loc[index][0]) <= 0.5]
averages.append(nearby[1].mean())
df['rollingAve'] = averages
Gives the desired output:
0 1 rollingAve
0 4.5 10 10.0
1 4.6 11 10.0
2 4.8 9 10.0
3 5.5 6 6.0
4 5.6 6 6.0
5 8.1 10 11.5
6 8.2 13 11.5
But this slows down substantially for big dataframes. Is there a way to implement .rolling() with varying window sizes, or something similar?
Panda's BaseIndexer is quite handy, although it takes a little bit of head-scratching to get it right.
In the following, I use np.searchsorted to quickly find the indices (start, end) of each window:
from pandas.api.indexers import BaseIndexer
class RangeWindow(BaseIndexer):
def __init__(self, val, width):
self.val = val.values
self.width = width
def get_window_bounds(self, num_values, min_periods, center, closed):
if min_periods is None: min_periods = 0
if closed is None: closed = 'left'
w = (-self.width/2, self.width/2) if center else (0, self.width)
side0 = 'left' if closed in ['left', 'both'] else 'right'
side1 = 'right' if closed in ['right', 'both'] else 'left'
ix0 = np.searchsorted(self.val, self.val + w[0], side=side0)
ix1 = np.searchsorted(self.val, self.val + w[1], side=side1)
ix1 = np.maximum(ix1, ix0 + min_periods)
return ix0, ix1
Some deluxe options: min_periods, center, and closed are implemented according to what the DataFrame.rolling specifies.
Application:
df = pd.DataFrame([
[4.5, 10],
[4.6, 11],
[4.8, 9],
[5.5, 6],
[5.6, 6],
[8.1, 10],
[8.2, 13]
], columns='a b'.split())
df.b.rolling(RangeWindow(df.a, width=1.0), center=True, closed='both').mean()
# gives:
0 10.0
1 10.0
2 10.0
3 6.0
4 6.0
5 11.5
6 11.5
Name: b, dtype: float64
Timing:
df = pd.DataFrame(
np.random.uniform(0, 1000, size=(1_000_000, 2)),
columns='a b'.split(),
)
df = df.sort_values('a').reset_index(drop=True)
%%time
avg = df.b.rolling(RangeWindow(df.a, width=1.0)).mean()
CPU times: user 133 ms, sys: 3.58 ms, total: 136 ms
Wall time: 135 ms
Update on performance:
Following a comment from #anon01, I was wondering if one could go faster for the case when the rolling involves large windows. Turns out I should have measured Pandas's rolling mean and sum performance first... (Premature optimization, anyone?) See at the end why.
Anyway, the idea was to do a cumsum just once, then take the difference of elements dereferenced by the windows endpoints:
# both below working on numpy arrays:
def fast_rolling_sum(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return z[ix1] - z[ix0]
def fast_rolling_mean(a, b, width):
z = np.concatenate(([0], np.cumsum(b)))
ix0 = np.searchsorted(a, a - width/2, side='left')
ix1 = np.searchsorted(a, a + width/2, side='right')
return (z[ix1] - z[ix0]) / (ix1 - ix0)
With this (and the 1-million rows df above), I see:
%timeit fast_rolling_mean(df.a.values, df.b.values, width=100.0)
# 93.9 ms ± 335 µs per loop
versus:
%timeit df.rolling(RangeWindow(df.a, width=100.0), min_periods=1).mean()
# 248 ms ± 1.54 ms per loop
However!!! Pandas is likely already doing such an optimization (it's a pretty obvious one). The timings don't increase with larger windows (which is why I was saying I should have checked first).
df.rolling and series.rolling do allow for value-based windows if the index is of type DateTimeIndex or TimedeltaIndex. You can use this to get close to the desired result:
df = df.set_index(pd.TimedeltaIndex(df[0]*1e9))
df["rolling_mean"] = df[1].rolling("1s").mean()
df = df.reset_index(drop=True)
output:
0 1 rolling_mean
0 4.5 10 10.000000
1 4.6 11 10.500000
2 4.8 9 10.000000
3 5.5 6 8.666667
4 5.6 6 7.000000
5 8.1 10 10.000000
6 8.2 13 11.500000
Advantages
This is a three-line solution that should have great performance, leveraging pandas datetime backend.
Disadvantages
This is definitely a hack, casting your miles column to time-delta seconds, and the average isn't centered (center isn't implemented for datetimelike and offset based windows).
Overall: if you value performance and can live with a non-centered mean, this would be a great way to go with a comment or two.
I'm trying to calculate a rolling statistic that requires all variables in a window from two input columns.
My only solution involves a for loop. Is there a more efficient way, perhaps using Pandas' rolling and apply functions?
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])[1]
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.rolling(2).apply(lambda x: f(x), raw=False) # KeyError: 'a'
I get KeyError: 'a' because df gets passed to f() one series (column) at a time. Specifying axis=1 sends one row and all columns to f(), but neither approach provides the required set of observations.
You could try rolling, mean and sum:
df['result'] = df.rolling(2).mean().sum(axis=1)
a b result
0 1 5 0.0
1 2 6 7.0
2 3 7 9.0
3 4 8 11.0
EDIT
Adding a different answer based upon new information in the question by OP.
Set up the function.
import pandas as pd
from statsmodels.tsa.stattools import coint
def f(x):
return coint(x['a'], x['b'])
Create the data and dataframe:
a_data = [1,2,3,4]
b_data = [5,6,7,8]
df = pd.DataFrame(data={'a': a_data, 'b': b_data})
a b
0 1 5
1 2 6
2 3 7
3 4 8
I gather after researching coint that you are trying to pass two rolling arrays to f['a'] and f['b']. The following will create the arrays and dataframe.
n=2
arr_a = [df['a'].shift(x).values[::-1][:n] for x in range(len(df['a']))[::-1]]
arr_b = [df['b'].shift(x).values[::-1][:n] for x in range(len(df['b']))[::-1]]
df1 = pd.DataFrame(data={'a': arr_a, 'b': arr_b})
n is the size of the rolling window.
df1
a b
0 [1.0, nan] [5.0, nan]
1 [2.0, 1.0] [6.0, 5.0]
2 [3.0, 2.0] [7.0, 6.0]
3 [4, 3] [8, 7]
Then you can use apply.(f) to send in the rows of arrays.
df1.iloc[(n-1):,].apply(f, axis=1)
Your output is as follows:
1 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
2 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
3 (-inf, 0.0, [-48.37534, -16.26923, -10.00565])
dtype: object
When I run this I do get an error for perfectly colinear data, but I suspect that will disappear with real data.
Also, I know a purely vecotorized solution might have been faster. I wonder what the performance will be like for this if it what you are looking for?
Hats off to #Zero who really had the solution for this problem here.
I tried placing the sum before the rolling:
import pandas as pd
import time
df = pd.DataFrame(data={'a': [1, 2, 3, 4], 'b': [5, 6, 7, 8]})
df2 = df.copy()
s = time.time()
df2.loc[:, 'mean1'] = df.sum(axis = 1).rolling(2).mean()
print(time.time() - s)
s = time.time()
df2.loc[:, 'mean2'] = df.rolling(2).mean().sum(axis=1)
print(time.time() - s)
df2
0.003737926483154297
0.005460023880004883
a b mean1 mean2
0 1 5 NaN 0.0
1 2 6 7.0 7.0
2 3 7 9.0 9.0
3 4 8 11.0 11.0
It is slightly faster than the previous answer, but works the same and maybe in large datasets the difference migth significant.
You can modify it to select the columns of interest only:
s = time.time()
print(df[['a', 'b']].sum(axis = 1).rolling(2).mean())
print(time.time() - s)
0 NaN
1 7.0
2 9.0
3 11.0
dtype: float64
0.0033559799194335938
Derived from another question, here
I got a 2 million rows DataFrame, something similar to this
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
})
I need to run calculations with the values on each row and append the results as new columns, something similar to the question in this link.
I know that there a lot of combinations of speed, temp, and temp2 that are repeated if I drop_duplicates the resulting DataFrame is only 50k rows length, which takes significantly less time to process, using an apply function like this:
def dafunc(row):
row['r1'] = row['speed'] * row['temp1'] * k1
row['r2'] = row['speed'] * row['temp2'] * k2
nodup_df = final_df.drop_duplicates(['speed,','temp1','temp2'])
nodup_df = dodup_df.apply(dafunc,axis=1)
The above code is super simplified of what I actually do.
So far I'm trying to use a dictionary where I store the results and a string formed of the combinations is the key, if the dictionary already has those results, I get them instead of making the calculations again.
Is there a more efficient way to do this using Pandas' vectorized operations?
EDIT:
In the end, the resulting DataFrame should look like this:
#assuming k1 = 0.5, k2 = 1
resulting_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
'r1': [22.5,16,3.5,16,3.5,16],
'r2': [10,8,7,8,7,8],
})
Well if you can access the columns from a numpy array based on the column index it would be a lot faster i.e
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
If you want to create multiple columns at once you can use a for loop for that and speed will be similar like
k = [0.5,1]
for i in range(1,3):
final_df['r'+str(i)] = final_df.values[:,0]*final_df.values[:,i]*k[i-1]
If you drop duplicates it will be much faster.
Output:
speed temp temp2 ts r1 r2
0 5 9 2 0 22.5 10.0
1 4 8 2 1 16.0 8.0
2 1 7 7 2 3.5 7.0
3 4 8 2 3 16.0 8.0
4 1 7 7 4 3.5 7.0
5 4 8 2 5 16.0 8.0
For small dataframe
%%timeit
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
1000 loops, best of 3: 708 µs per loop
For large dataframe
%%timeit
ndf = pd.concat([final_df]*10000)
ndf['r1'] = ndf.values[:,0]*ndf.values[:,1]*k1
ndf['r2'] = ndf.values[:,0]*ndf.values[:,2]*k2
1 loop, best of 3: 6.19 ms per loop
Lets say I have a pandas.Dataframe that looks as follows:
c1 | c2
-------
1 | 5
2 | 6
3 | 7
4 | 8
.....
1 | 7
and I'm looking to map a function (DataFrame.corr) but I would like it to take n rows at a time. The result should be a series with the correlation values that would be shorter than the original DataFrame or with a few values that didn't get a full n rows of data.
Is there a way to do this and how? I've been looking through the DataFrame and Map, Apply, Filter documentation but it doesn't seem to have an obvious or clean solution.
With pandas 0.20, using rolling with corr produces a multi indexed dataframe. You can slice afterwards to get at what you're looking for.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['c1', 'c2'])
c1 c2
0 0 2
1 7 3
2 8 7
3 0 6
4 8 6
5 0 2
6 0 4
7 9 7
8 3 2
9 4 3
rolling + corr... pandas 0.20.x
df.rolling(5).corr().dropna().c1.xs('c2', level=1)
# Or equivalently
# df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c1, dtype: float64
rolling + corr... pandas 0.19.x or prior
Prior to 0.20, rolling + corr produced a pd.Panel
df.rolling(5).corr().loc[:, 'c1', 'c2'].dropna()
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c2, dtype: float64
numpy + as_strided
However, I wasn't satisfied with the above answers. Below is a specialized function that takes an nx2 dataframe and returns a series of the rolling correlations. DISCLAIMER This uses some advanced techniques and should really only be used if you know what this does. Meaning if you need a detailed breakdown of how this works... then it probably isn't for you.
from numpy.lib.stride_tricks import as_strided as strided
def rolling_correlation(a, w):
n, m = a.shape[0], 2
s1, s2 = a.strides
b = strided(a, (m, w, n - w + 1), (s2, s1, s1))
b_mb = b - b.mean(1, keepdims=True)
b_ss = (b_mb ** 2).sum(1) ** .5
return (b_mb[0] * b_mb[1]).sum(0) / (b_ss[0] * b_ss[1])
def rolling_correlation_df(df, w):
a = df.values
return pd.Series(rolling_correlation(a, w), df.index[w-1:])
rolling_correlation_df(df, 5)
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
dtype: float64
Timing
small data
%timeit rolling_correlation_df(df, 5)
10000 loops, best of 3: 79.9 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
100 loops, best of 3: 14.6 ms per loop
large data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10000, 2)), columns=['c1', 'c2'])
%timeit rolling_correlation_df(df, 5)
1000 loops, best of 3: 615 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
1 loop, best of 3: 1.98 s per loop