median in pandas dropping center value - python

I am working in pandas and want to implement an algorithm that requires I assess a modified centered median on a window, but omitting the middle value. So for instance the unmodified might be:
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
med = ser.rolling(5,center=True).median()
print(med)
and I would like the result for med[3] to be 3.5 (the median of 1.,2.,4.,6.) rather than 4.5 which the ordinary windowed median. Is there an economical way to do this?

Try:
import numpy as np
import pandas as pd
ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
med = ser.rolling(5).apply(lambda x: np.median(np.concatenate([x[0:2],x[3:5]]))).shift(-2)
print(med)
With output:
0 NaN
1 NaN
2 2.75
3 3.50
4 5.25
5 6.50
6 NaN
7 NaN
And more generally:
rolling_size = 5
ser.rolling(rolling_size).apply(lambda x: np.median(np.concatenate([x[0:int(rolling_size/2)],x[int(rolling_size/2)+1:rolling_size]]))).shift(-int(rolling_size/2))

ser = pd.Series(data=[0.,1.,2.,4.5,5.,6.,8.,9])
def median(series, window = 2):
df = pd.DataFrame(series[window:].reset_index(drop=True))
df[1] = series[:-window]
df = df.apply(lambda x: x.mean(), axis=1)
df.index += window - 1
return df
median(ser)
I think it is simpler

Related

Pandas pct_change with moving average

I would like to use pandas' pct_change to compute the rate of change between each value and the previous rolling average (before that value). Here is what I mean:
If I have:
import pandas as pd
df = pd.DataFrame({'data': [1, 2, 3, 7]})
I would expect to get, for window size of 2:
0 NaN
1 NaN
2 1
3 1.8
, because roc(3, avg(1, 2)) = (3-1.5)/1.5 = 1 and same calculation goes for 1.8. using pct_change with periods parameter just skips previous nth entries, it doesn't do the job.
Any ideas on how to do this in an elegant pandas way for any window size?
here is one way to do it, using rolling and shift
df['avg']=df.rolling(2).mean()
df['poc'] = (df['data'] - df['avg'].shift(+1))/ df['avg'].shift(+1)
df.drop(columns='avg')
data poc
0 1 NaN
1 2 NaN
2 3 1.0
3 7 1.8

Improve efficiency of selecting values from dataframe by index

I have a simulation that uses pandas Dataframes to describe objects in a hierarchy. To achieve this, I have used a MultiIndex to show the route to a child object.
Parent df
par_val
a b
0 0.0 0.366660
1.0 0.613888
1 2.0 0.506531
3.0 0.327356
2 4.0 0.684335
0.0 0.013800
3 1.0 0.590058
2.0 0.179399
4 3.0 0.790628
4.0 0.310662
Child df
child_val
a b c
0 0.0 0 0.528217
1.0 0 0.515479
1 2.0 0 0.719221
3.0 0 0.785008
2 4.0 0 0.249344
0.0 0 0.455133
3 1.0 0 0.009394
2.0 0 0.775960
4 3.0 0 0.639091
4.0 0 0.150854
0 0.0 1 0.319277
1.0 1 0.571580
1 2.0 1 0.029063
3.0 1 0.498197
2 4.0 1 0.424188
0.0 1 0.572045
3 1.0 1 0.246166
2.0 1 0.888984
4 3.0 1 0.818633
4.0 1 0.366697
This implies that object (0,0,0) and (0,0,1) in the child Dataframes are both characterised by values at (0,0) in the parent Dataframe.
When a function is performed on the child dataframe for a certain subject of 'a', it may therefore need to grab a value from 'b'. My current solution locates the value from the parent Dataframe by index within the solution function:
import pandas as pd
import numpy as np
import time
from matplotlib import pyplot as plt
r = range(10, 1000, 10)
dt = []
for i in r:
start = time.time()
df_par = pd.DataFrame(
{'a': np.repeat(np.arange(5), i/5),
'b': np.append(np.arange(i/2), np.arange(i/2)),
'par_val': np.random.rand(i)
}).set_index(['a','b'])
df_child = pd.concat([df_par[[]]] * 2, keys = [0, 1], names = ['c'])\
.reorder_levels(['a', 'b', 'c'])
df_child['child_val'] = np.random.rand(i * 2)
df_child['solution'] = np.nan
def solution(row, df_par, var):
data_level = len(df_par.index.names)
index_filt = tuple([row.name[i] for i in range(data_level)])
sol = df_par.loc[index_filt, 'par_val'] / row.child_val
return sol
a_mask = df_child.index.get_level_values('a') == 0
df_child.loc[a_mask, 'solution'] = df_child.loc[a_mask].apply(solution,
df_par = df_par,
var = 10,
axis = 1)
stop = time.time()
dt.append(stop - start)
plt.plot(r, dt)
plt.show()
The solution function is becoming very costly for large amounts of iterations in the simulation:
(iterations (x) vs time in seconds (y))
Is there a more efficient method of calculating this? I have considered including the 'par_val' in the child df, but I was trying to avoid this as the very large amount of repetitions reduces the amount of simulations I can fit in RAM.
par_val is a float64 which takes 8 bytes for each value. If the child data frame has 1 million rows, that's 8MB of memory (before the OS's Memory Compression feature kicks in). If it has 1 billions rows, then yes, I would worry about the memory impact.
The bigger performance bottleneck though, is in your df_child.loc[a_mask].apply(..., axis=1) line. This makes pandas uses the slow Python loop instead of the much faster vectorized code. In SQL, we call the loop approach row-by-agonizing-row and it's an anti-pattern. You generally want to avoid .apply(..., axis=1) for this reason.
Here's one way to improve the performance without changing df_par or df_child:
a_mask = df_child.index.get_level_values('a') == 0
child_val = df_child.loc[a_mask, 'child_val'].droplevel(-1)
solution = df_par.loc[child_val.index, 'par_val'] / child_val
df_child.loc[a_mask, 'solution'] = solution.to_numpy()
Before:
After:

How to speed up conditional statement in python

I am trying to generate a new column in a pandas dataframe by loop over >100,000 rows and setting the value of the row conditional on an already existing row.
The current dataframe is a dummy but works as an example. My current code is:
df=pd.DataFrame({'IT100':[5,5,-0.001371,0.0002095,-5,0,-5,5,5],
'ET110':[0.008187884,0.008285232,0.00838258,0.008479928,1,1,1,1,1]})
# if charging set to 1, if discharging set to -1.
# if -1 < IT100 < 1 then set CD to previous cells value
# Charging is defined as IT100 > 1 and Discharge is defined as IT100 < -1
def CD(dataFrame):
for x in range(0,len(dataFrame.index)):
current = dataFrame.loc[x,"IT100"]
if x == 0:
if dataFrame.loc[x+5,"IT100"] > -1:
dataFrame.loc[x,"CD"] = 1
else:
dataFrame.loc[x,"CD"] = -1
else:
if current > 1:
dataFrame.loc[x,"CD"] = 1
elif current < -1:
dataFrame.loc[x,"CD"] = -1
else:
dataFrame.loc[x,"CD"] = dataFrame.loc[x-1,"CD"]
Using if/Else loops is extremely slow. I see that people have suggested to use np.select() or pd.apply(), but I do not know if this will work for my example. I need to be able to index the column because one of my conditions is to set the value of the new column to the value of the previous cell in the column of interest.
Thanks for any help!
#Grajdeanu Alex is right, the loop is slowing you down more than whatever you're doing inside of it. With pandas, a loop is usually the slowest choice. Try this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
df['CD'] = np.nan
#lower saturation
df.loc[df['IT100'] < -1,['CD']] = -1
#upper saturation
df.loc[df['IT100'] > 1,['CD']] = 1
#fill forward
df['CD'] = df['CD'].ffill()
# setting the first row equal to the fifth
df.loc[0,['CD']] = df.loc[5,['CD']]
using ffill will use the last valid value to fill in subsequent nan values (-1 < x < 1)
Similar to EMiller's answer, you could also use clip.
import pandas as pd
import numpy as np
df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
df['CD'] = df['IT100'].clip(-1, 1)
df.loc[~df['CD'].isin([-1, 1]), 'CD'] = np.nan
df['CD'] = df['CD'].ffill()
df.loc[0,['CD']] = df.loc[5,['CD']]
As an alternate to #EMiller's answer
In [213]: df = pd.DataFrame({'IT100':[0,-50,-20,-0.5,-0.25,-0.5,-10,5,0.5]})
In [214]: df
Out[214]:
IT100
0 0.00
1 -50.00
2 -20.00
3 -0.50
4 -0.25
5 -0.50
6 -10.00
7 5.00
8 0.50
In [215]: df['CD'] = pd.Series(np.where(df['IT100'].between(-1, 1), np.nan, df['IT100'].clip(-1, 1))).ffill()
In [217]: df.loc[0, 'CD'] = 1 if df.loc[5, 'IT100'] > -1 else -1
In [218]: df
Out[218]:
IT100 CD
0 0.00 1.0
1 -50.00 -1.0
2 -20.00 -1.0
3 -0.50 -1.0
4 -0.25 -1.0
5 -0.50 -1.0
6 -10.00 -1.0
7 5.00 1.0
8 0.50 1.0

Rolling average all values of pandas DataFrame

I have a pandas DataFrame and I want to calculate on a rolling basis the average of all the value: for all the columns, for all the observations in the rolling window.
I have a solution with loops but feels very inefficient. Note that I can have NaNs in my data, so calculating the sum and dividing by the shape of the window would not be safe (as I want a nanmean).
Any better approach?
Setup
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randint(0, 10, size=(10, 2)), columns=['A', 'B'])
df[df>5] = np.nan # EDIT: add nans
My Attempt
n_roll = 2
df_stacked = df.values
roll_avg = {}
for idx in range(n_roll, len(df_stacked)+1):
roll_avg[idx-1] = np.nanmean(df_stacked[idx - n_roll:idx, :].flatten())
roll_avg = pd.Series(roll_avg)
roll_avg.index = df.index[n_roll-1:]
roll_avg = roll_avg.reindex(df.index)
Desired Result
roll_avg
Out[33]:
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
Thanks!
Here's one NumPy solution with sliding windows off view_as_windows -
from skimage.util.shape import view_as_windows
# Setup o/p array
out = np.full(len(df),np.nan)
# Get sliding windows of length n_roll along axis=0
w = view_as_windows(df.values,(n_roll,1))[...,0]
# Assign nan-ignored mean values computed along last 2 axes into o/p
out[n_roll-1:] = np.nanmean(w, (1,2))
Memory efficiency with views -
In [62]: np.shares_memory(df,w)
Out[62]: True
To be able to get the same result in case of nan, you can use column_stack on all the df.shift(i).values for i in range(n_roll), use nanmean on axis=1, and then you need to replace the first n_roll-1 value with nan after:
roll_avg = pd.Series(np.nanmean(np.column_stack([df.shift(i).values for i in range(n_roll)]),1))
roll_avg[:n_roll-1] = np.nan
and with the second input with nan, you get as expected
0 NaN
1 5.000000
2 1.666667
3 0.333333
4 1.000000
5 3.000000
6 3.250000
7 3.250000
8 3.333333
9 4.000000
dtype: float64
Using the answer referenced in the comment, one can do:
wsize = n_roll
cols = df.shape[1]
out = group.stack(dropna=False).rolling(window=wsize * cols, min_periods=1).mean().reset_index(-1, drop=True).sort_index()
out.groupby(out.index).last()
out.iloc[:nroll-1] = np.nan
In my case it was important to specify dropna=False in stack, otherwise the length of the rolling window would not be correct.
But I am looking forward to other approaches as this does not feel very elegant/efficient.

Applying a function to a pandas col

I would like to map the function GetPermittedFAR to my dataframe(df) such that I could test if a value in the col zonedist1 == a certain value I could build new cols such as df['FAR_Permitted'] etc.
I have tried various means of map() etc. but haven't gotten this to work. I feel this should be a pretty simple thing to do?
Ideally, I would use a simple list comprehension / lambda as I have many of these test conditional values resulting in col data to create.
import pandas as pd
import numpy as np
def GetPermittedFAR():
if df['zonedist1'] == 'R7-3':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R3-2':
df['FAR_Permitted'] = 0.5
df['Building Height Max'] = 35
if df['zonedist1'] == 'R1-1':
df['FAR_Permitted'] = 0.7
df['Building Height Max'] = 100
#etc...if statement for each unique value in 'zonedist'
df = pd.DataFrame({'zonedist1':['R7-3', 'R3-2', 'R1-1',
'R1-2', 'R2', 'R2A', 'R2X',
'R1-1','R7-3','R3-2','R7-3',
'R3-2', 'R1-1', 'R1-2'
]}
df = df.apply(lambda x: GetPermittedFAR(), axis=1)
How about using pd.merge()?
Let df be your dataframe
In [612]: df
Out[612]:
zonedist1
0 R7-3
1 R3-2
2 R1-1
3 R1-2
4 R2
5 R2A
6 R2X
merge be another dataframe with conditions
In [613]: merge
Out[613]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
Then, merge df with merge on 'left'
In [614]: df.merge(merge, how='left')
Out[614]:
zonedist1 FAR_Permitted Building Height Max
0 R7-3 0.5 35
1 R3-2 0.5 35
2 R1-1 NaN NaN
3 R1-2 NaN NaN
4 R2 NaN NaN
5 R2A NaN NaN
6 R2X NaN NaN
Later you can replace NaN values.

Categories

Resources