Add a state column when another column is increasing/decreasing - python

I would like to add a column in a data frame when another column is increasing/decreasing or stays the same with:
1 -> increasing, 0 -> same, -1 -> decreasing
So if df['battery'] = [1,2,3,4,7,9,3,3,3,]
I would like state to be df['state'] = [1,1,1,1,1,-1,0,0]

This should do the trick!
a = [1,2,3,4,7,9,3,3,3]
b = []
for x in range(len(a)-1):
b.append((a[x+1] > a[x]) - (a[x+1] < a[x]))
print(b)

You could use pd.Series.diff method to get the difference between consecutive values, and then assign the necessary state values by using boolean indexing:
import pandas as pd
df = pd.DataFrame()
df['battery'] = [1,2,3,4,7,9,3,3,3]
diff = df['battery'].diff()
df.loc[diff > 0, 'state'] = 1
df.loc[diff == 0, 'state'] = 0
df.loc[diff < 0, 'state'] = -1
print(df)
# battery state
# 0 1 NaN
# 1 2 1.0
# 2 3 1.0
# 3 4 1.0
# 4 7 1.0
# 5 9 1.0
# 6 3 -1.0
# 7 3 0.0
# 8 3 0.0
Or, alternatively, one could use np.select:
import numpy as np
diff = df['battery'].diff()
df['state'] = np.select([diff < 0, diff > 0], [-1, 1], 0)
# Be careful, default 0 will replace the first NaN as well.
print(df)
# battery state
# 0 1 0
# 1 2 1
# 2 3 1
# 3 4 1
# 4 7 1
# 5 9 1
# 6 3 -1
# 7 3 0
# 8 3 0

So here's your dataframe:
>>> import pandas as pd
>>> data = [[[1,2,3,4,7,9,3,3,3]]]
>>> df = pd.DataFrame(data, columns = ['battery'])
>>> df
battery
0 [1, 2, 3, 4, 7, 9, 3, 3, 3]
And finally use apply and a lambda function in order to generate the required result:
>>> df['state'] = df.apply(lambda row: [1 if t - s > 0 else -1 if t-s < 0 else 0 for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 1, 1, -1, 0, 0]
Alternatively, if you want the exact difference between each element in the list, you can use the following:
>>> df['state'] = df.apply(lambda row: [t - s for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 3, 2, -6, 0, 0]

Try pd.np.sign
pd.np.sign(df.battery.diff().fillna(1))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 -1.0
7 0.0
8 0.0
Name: battery, dtype: float64

Related

Pandas: find interval distance from N consecutive to M consecutive

TLDR version:
I have a column like below,
[2, 2, 0, 0, 0, 2, 2, 0, 3, 3, 3, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3, 3, 3]
# There is the probability that has more sequences, like 4, 5, 6, 7, 8...
I need a function that has parameters n,m, if I use
n=2, m=3,
I will get a distance between 2 and 3, and then final result after the group could be :
[6, 9]
Detailed version
Here is the test case. And I'm writing a function that will give n,m then generate a list of distances between each consecutive. Currently, this function can only work with one parameter N (which is the distance from N consecutive to another N consecutive). I want to make some changes to this function to make it accept M.
dummy = [1,1,0,0,0,1,1,0,1,1,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1,1]
df = pd.DataFrame({'a': dummy})
What I write currently,
def get_N_seq_stat(df, N=2, M=3):
df["c1"] = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
df["c2"] = np.where(df.c1.ne(N) , 1, 0)
df["c3"] = df["c2"].ne(df["c2"].shift()).cumsum()
result = df.loc[df["c2"] == 1].groupby("c3")["c2"].count().tolist()
# if last N rows are not consequence shouldn't add last.
if not (df["c1"].tail(N) == N).all():
del result[-1]
if not (df["c1"].head(N) == N).all():
del result[0]
return result
if I set N=2, M=3 ( from 2 consecutive to 3 consecutive), Then the ideal value return from this would be [6,9] because below.
dummy = [1,1,**0,0,0,1,1,0,**1,1,1,0,0,1,1,**0,0,0,0,1,1,0,0,0,**1,1,1]
Currently, if I set N =2, the return list would be [3, 6, 4] that because
dummy = [1,1,**0,0,0,**1,1,**0,1,1,1,0,0,**1,1,**0,0,0,0,**1,1,0,0,0,1,1,1]
I would modify your code this way:
def get_N_seq_stat(df, N=2, M=3, debug=False):
# get number of consecutive 1s
c1 = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
# find stretches between N and M
m1 = c1.eq(N)
m2 = c1.eq(M)
c2 = pd.Series(np.select([m1.shift()&~m1, m2], [True, False], np.nan),
index=df.index).ffill().eq(1)
# debug mode to understand how this works
if debug:
return df.assign(c1=c1, c2=c2,
length=c2[c2].groupby(c2.ne(c2.shift()).cumsum())
.transform('size')
)
# get the length of the stretches
return c2[c2].groupby(c2.ne(c2.shift()).cumsum()).size().to_list()
get_N_seq_stat(df, N=2, M=3)
Output: [6, 9]
Intermediate c1, c2, and length:
get_N_seq_stat(df, N=2, M=3, debug=True)
a c1 c2 length
0 1 2 False NaN
1 1 2 False NaN
2 0 0 True 6.0
3 0 0 True 6.0
4 0 0 True 6.0
5 1 2 True 6.0
6 1 2 True 6.0
7 0 0 True 6.0
8 1 3 False NaN
9 1 3 False NaN
10 1 3 False NaN
11 0 0 False NaN
12 0 0 False NaN
13 1 2 False NaN
14 1 2 False NaN
15 0 0 True 9.0
16 0 0 True 9.0
17 0 0 True 9.0
18 0 0 True 9.0
19 1 2 True 9.0
20 1 2 True 9.0
21 0 0 True 9.0
22 0 0 True 9.0
23 0 0 True 9.0
24 1 3 False NaN
25 1 3 False NaN
26 1 3 False NaN

Compare two pandas DataFrames in the most efficient way

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?
IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7
Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

Pandas using apply lambda with two different operators

This question is very similar to one I posted before with just one change. Instead of doing just the absolute difference for all the columns I also want to find the magnitude difference for the 'Z' column, so if the current Z is 1.1x greater than prev than keep it.
(more context to the problem)
Pandas using the previous rank values to filter out current row
df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.25, 3, 6],
})
print(df)
# rank x y z
# 0 1 0 0 1.00
# 1 1 3 4 3.00
# 2 2 0 0 1.20
# 3 2 3 4 3.25
# 4 3 4 5 3.00
# 5 3 2 5 6.00
Here's what I want the output to be
output = pd.DataFrame({
'rank': [1, 1, 2, 3],
'x': [0, 3, 0, 2],
'y': [0, 4, 0, 5],
'z': [1, 3, 1.2, 6],
})
print(output)
# rank x y z
# 0 1 0 0 1.0
# 1 1 3 4 3.0
# 2 2 0 0 1.2
# 5 3 2 5 6.00
basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (<1.1z) to remove it.
So for the rows rank 1 ANY rows in rank 2 that have any combo of x = (-1-1), y = (-1-1), z= (<1.1) OR x = (2-5), y = (3-5), z= (<3.3) I want it to be removed
Here's a solution using numpy broadcasting:
# Initially, no row is dropped
df['drop'] = False
for r in range(df['rank'].min(), df['rank'].max()):
# Find the x_min, x_max, y_min, y_max, z_max of the current rank
cond = df['rank'] == r
x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T
x_min, x_max = x + [[-1], [1]] # use numpy broadcasting to ±1 in one command
y_min, y_max = y + [[-1], [1]]
z_max = z * 1.1
# Find the x, y, z of the next rank. Raise them one dimension
# so that we can make a comparison matrix again x_min, x_max, ...
cond = df['rank'] == r + 1
if not cond.any():
continue
x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T[:, :, None]
# Condition to drop a row
drop = (
(x_min <= x) & (x <= x_max) &
(y_min <= y) & (y <= y_max) &
(z <= z_max)
).any(axis=1)
df.loc[cond, 'drop'] = drop
# Result
df[~df['drop']]
Condensed
An even more condensed version (and likely faster). This is a really good way to puzzle your future teammates when they read the code:
r, x, y, z = df[['rank', 'x', 'y', 'z']].T.to_numpy()
rr, xx, yy, zz = [col[:,None] for col in [r, x, y, z]]
drop = (
(rr == r + 1) &
(x-1 <= xx) & (xx <= x+1) &
(y-1 <= yy) & (yy <= y+1) &
(zz <= z*1.1)
).any(axis=1)
# Result
df[~drop]
What this does is comparing every row in df against each other (including itself) and return True (i.e. drop) if:
The current row's rank == the other row's rank + 1; and
The current row's x, y, z fall within the specified range of the other row's x, y, z
You need to slightly modify my previous code:
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check is a previous group exists, else flag all rows False (i.e. not to be dropped)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if all differences are within 1/1/0.1*z for x/y/z
# for at least one rows of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*s['z']]).all(1).any(), axis=1)
groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]
output:
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
5 3 2 5 6.0
I have modified mozway's function so that it works according to your requirements.
# comparing 'equal' float values, may go wrong, that's why I am using this constant
DELTA=0.1**12
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check if a previous group exists, else flag all rows False (i.e. not to be dropped)
#return pd.Series(False, index=d.index)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if differences in x and y are within 1 and z < 1.1*x
# for at least one row of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: (abs(d_prev-s)[['x', 'y']].le([1,1]).all(1)&
(s['z']<1.1*d_prev['x']-DELTA)).any(), axis=1)
tests,
>>> df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.25, 3, 6],
})
>>> df
rank x y z
0 1 0 0 1.00
1 1 3 4 3.00
2 2 0 0 1.20
3 2 3 4 3.25
4 3 4 5 3.00
5 3 2 5 6.00
>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
5 3 2 5 6.0
>>> df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.3, 3, 6],
})
>>> df
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 2 3 4 3.3
4 3 4 5 3.0
5 3 2 5 6.0
>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 2 3 4 3.3
5 3 2 5 6.0
Just takes an adjustment to the z term of the lamda equation from the linked post:
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)
Here's the full code that works for me:
df = pd.DataFrame({
'rank': [1, 1, 2, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 3, 4, 2],
'y': [0, 4, 0, 4, 4, 5, 5],
'z': [1, 3, 1.2, 3.3, 3.31, 3, 6],
})
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check is a previous group exists, else flag all rows False (i.e. not to be dropped)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if all differences are within 1/1/0.1*z for x/y/z
# for at least one rows of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)
groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]
This works for me on Python 3.8.6
import pandas as pd
dfg = df.groupby("rank")
def filter_func(dfg):
for g in dfg.groups.keys():
if g-1 in dfg.groups.keys():
yield (
pd.merge(
dfg.get_group(g).assign(id = lambda df: df.index),
dfg.get_group(g-1),
how="cross", suffixes=("", "_prev")
).assign(
cond = lambda df: ~(
(df.x - df.x_prev).abs().le(1) & (df.y - df.y_prev).abs().le(1) & df.z.divide(df.z_prev).lt(1.1)
)
)
).groupby("id").agg(
{
**{"cond": "all"},
**{k: "first" for k in df.columns}
}).loc[lambda df: df.cond].drop(columns = ["cond"])
else:
yield dfg.get_group(g)
pd.concat(
filter_func(dfg), ignore_index=True
)
The output seems to match what you expected:
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 3 2 5 6.0
Small edit: in your question it seems like you care about the row index. The solution I posted just ignores this, but if you want to keep it, just save it as an additional column in the dataframe.

Vectorized cummulative sum based on value in array numpy [duplicate]

Let's say I have a Pandas DataFrame df:
Date Value
01/01/17 0
01/02/17 0
01/03/17 1
01/04/17 0
01/05/17 0
01/06/17 0
01/07/17 1
01/08/17 0
01/09/17 0
For each row, I want to efficiently calculate the days since the last occurence of Value=1.
So that df:
Date Value Last_Occurence
01/01/17 0 NaN
01/02/17 0 NaN
01/03/17 1 0
01/04/17 0 1
01/05/17 0 2
01/06/17 0 3
01/07/17 1 0
01/08/17 0 1
01/09/17 0 2
I could do a loop:
for i in range(0, len(df)):
last = np.where(df.loc[0:i,'Value']==1)
df.loc[i, 'Last_Occurence'] = i-last
But it seems very inefficient for extremely large data sets and probably isn't right anyway.
Here's a NumPy approach -
def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a==trigger_val)
if len(idx)==0:
return np.full(a.size,invalid_specifier)
else:
out[idx[0]] = -idx[0] + 1
out[0] = start_val
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
out[:idx[0]] = invalid_specifier
return out
Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values :
In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])
In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
...:
In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]:
array([[ 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0],
[-1, 0, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0, 0, 1],
[-1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 2],
[ 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 3, 4, 5, 0],
[ 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 3, 4, 5, 6, 1]])
Using it to solve our case :
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Sample output -
In [181]: df
Out[181]:
Date Value Last_Occurence
0 01/01/17 0 -1
1 01/02/17 0 -1
2 01/03/17 1 0
3 01/04/17 0 1
4 01/05/17 0 2
5 01/06/17 0 3
6 01/07/17 1 0
7 01/08/17 0 1
8 01/09/17 0 2
Runtime test
Approaches -
# #Scott Boston's soln
def pandas_groupby(df):
mask = df.Value.cumsum().replace(0,False).astype(bool)
return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
cumsum()).cumcount().where(mask))
# Proposed in this post
def numpy_based(df):
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Timings -
In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])
In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop
In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop
In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])
In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop
In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop
Let's try this using cumsum, cumcount, and groupby:
mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)
output:
Date Value Last_Occurance
0 01/01/17 0 NaN
1 01/02/17 0 NaN
2 01/03/17 1 0.0
3 01/04/17 0 1.0
4 01/05/17 0 2.0
5 01/06/17 0 3.0
6 01/07/17 1 0.0
7 01/08/17 0 1.0
8 01/09/17 0 2.0
You can use argmax:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 2
dtype: int64
If you have to have nan for the first 2 rows, use:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
if 1 in df.iloc[x.name::-1].Value.values \
else np.nan,axis=1)
Out[86]:
0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
5 3.0
6 0.0
7 1.0
8 2.0
dtype: float64
You don't have to update the value to last every step in the for loop. Initiate a variable outside the loop
last = np.nan
for i in range(len(df)):
if df.loc[i, 'Value'] == 1:
last = i
df.loc[i, 'Last_Occurence'] = i - last
and update it only when a 1 occurs in column Value.
Note that no matter what method you select, iterating the whole table once is inevitable.

Find the average of the element above and below in that column if that element is 0 - Pandas DataFrame

I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()

Categories

Resources