I would like to add a column in a data frame when another column is increasing/decreasing or stays the same with:
1 -> increasing, 0 -> same, -1 -> decreasing
So if df['battery'] = [1,2,3,4,7,9,3,3,3,]
I would like state to be df['state'] = [1,1,1,1,1,-1,0,0]
This should do the trick!
a = [1,2,3,4,7,9,3,3,3]
b = []
for x in range(len(a)-1):
b.append((a[x+1] > a[x]) - (a[x+1] < a[x]))
print(b)
You could use pd.Series.diff method to get the difference between consecutive values, and then assign the necessary state values by using boolean indexing:
import pandas as pd
df = pd.DataFrame()
df['battery'] = [1,2,3,4,7,9,3,3,3]
diff = df['battery'].diff()
df.loc[diff > 0, 'state'] = 1
df.loc[diff == 0, 'state'] = 0
df.loc[diff < 0, 'state'] = -1
print(df)
# battery state
# 0 1 NaN
# 1 2 1.0
# 2 3 1.0
# 3 4 1.0
# 4 7 1.0
# 5 9 1.0
# 6 3 -1.0
# 7 3 0.0
# 8 3 0.0
Or, alternatively, one could use np.select:
import numpy as np
diff = df['battery'].diff()
df['state'] = np.select([diff < 0, diff > 0], [-1, 1], 0)
# Be careful, default 0 will replace the first NaN as well.
print(df)
# battery state
# 0 1 0
# 1 2 1
# 2 3 1
# 3 4 1
# 4 7 1
# 5 9 1
# 6 3 -1
# 7 3 0
# 8 3 0
So here's your dataframe:
>>> import pandas as pd
>>> data = [[[1,2,3,4,7,9,3,3,3]]]
>>> df = pd.DataFrame(data, columns = ['battery'])
>>> df
battery
0 [1, 2, 3, 4, 7, 9, 3, 3, 3]
And finally use apply and a lambda function in order to generate the required result:
>>> df['state'] = df.apply(lambda row: [1 if t - s > 0 else -1 if t-s < 0 else 0 for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 1, 1, -1, 0, 0]
Alternatively, if you want the exact difference between each element in the list, you can use the following:
>>> df['state'] = df.apply(lambda row: [t - s for s, t in zip(row['battery'], row['battery'][1:])], axis=1)
>>> df
battery state
0 [1, 2, 3, 4, 7, 9, 3, 3, 3] [1, 1, 1, 3, 2, -6, 0, 0]
Try pd.np.sign
pd.np.sign(df.battery.diff().fillna(1))
0 1.0
1 1.0
2 1.0
3 1.0
4 1.0
5 1.0
6 -1.0
7 0.0
8 0.0
Name: battery, dtype: float64
Related
TLDR version:
I have a column like below,
[2, 2, 0, 0, 0, 2, 2, 0, 3, 3, 3, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3, 3, 3]
# There is the probability that has more sequences, like 4, 5, 6, 7, 8...
I need a function that has parameters n,m, if I use
n=2, m=3,
I will get a distance between 2 and 3, and then final result after the group could be :
[6, 9]
Detailed version
Here is the test case. And I'm writing a function that will give n,m then generate a list of distances between each consecutive. Currently, this function can only work with one parameter N (which is the distance from N consecutive to another N consecutive). I want to make some changes to this function to make it accept M.
dummy = [1,1,0,0,0,1,1,0,1,1,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1,1]
df = pd.DataFrame({'a': dummy})
What I write currently,
def get_N_seq_stat(df, N=2, M=3):
df["c1"] = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
df["c2"] = np.where(df.c1.ne(N) , 1, 0)
df["c3"] = df["c2"].ne(df["c2"].shift()).cumsum()
result = df.loc[df["c2"] == 1].groupby("c3")["c2"].count().tolist()
# if last N rows are not consequence shouldn't add last.
if not (df["c1"].tail(N) == N).all():
del result[-1]
if not (df["c1"].head(N) == N).all():
del result[0]
return result
if I set N=2, M=3 ( from 2 consecutive to 3 consecutive), Then the ideal value return from this would be [6,9] because below.
dummy = [1,1,**0,0,0,1,1,0,**1,1,1,0,0,1,1,**0,0,0,0,1,1,0,0,0,**1,1,1]
Currently, if I set N =2, the return list would be [3, 6, 4] that because
dummy = [1,1,**0,0,0,**1,1,**0,1,1,1,0,0,**1,1,**0,0,0,0,**1,1,0,0,0,1,1,1]
I would modify your code this way:
def get_N_seq_stat(df, N=2, M=3, debug=False):
# get number of consecutive 1s
c1 = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
# find stretches between N and M
m1 = c1.eq(N)
m2 = c1.eq(M)
c2 = pd.Series(np.select([m1.shift()&~m1, m2], [True, False], np.nan),
index=df.index).ffill().eq(1)
# debug mode to understand how this works
if debug:
return df.assign(c1=c1, c2=c2,
length=c2[c2].groupby(c2.ne(c2.shift()).cumsum())
.transform('size')
)
# get the length of the stretches
return c2[c2].groupby(c2.ne(c2.shift()).cumsum()).size().to_list()
get_N_seq_stat(df, N=2, M=3)
Output: [6, 9]
Intermediate c1, c2, and length:
get_N_seq_stat(df, N=2, M=3, debug=True)
a c1 c2 length
0 1 2 False NaN
1 1 2 False NaN
2 0 0 True 6.0
3 0 0 True 6.0
4 0 0 True 6.0
5 1 2 True 6.0
6 1 2 True 6.0
7 0 0 True 6.0
8 1 3 False NaN
9 1 3 False NaN
10 1 3 False NaN
11 0 0 False NaN
12 0 0 False NaN
13 1 2 False NaN
14 1 2 False NaN
15 0 0 True 9.0
16 0 0 True 9.0
17 0 0 True 9.0
18 0 0 True 9.0
19 1 2 True 9.0
20 1 2 True 9.0
21 0 0 True 9.0
22 0 0 True 9.0
23 0 0 True 9.0
24 1 3 False NaN
25 1 3 False NaN
26 1 3 False NaN
Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?
IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7
Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7
This question is very similar to one I posted before with just one change. Instead of doing just the absolute difference for all the columns I also want to find the magnitude difference for the 'Z' column, so if the current Z is 1.1x greater than prev than keep it.
(more context to the problem)
Pandas using the previous rank values to filter out current row
df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.25, 3, 6],
})
print(df)
# rank x y z
# 0 1 0 0 1.00
# 1 1 3 4 3.00
# 2 2 0 0 1.20
# 3 2 3 4 3.25
# 4 3 4 5 3.00
# 5 3 2 5 6.00
Here's what I want the output to be
output = pd.DataFrame({
'rank': [1, 1, 2, 3],
'x': [0, 3, 0, 2],
'y': [0, 4, 0, 5],
'z': [1, 3, 1.2, 6],
})
print(output)
# rank x y z
# 0 1 0 0 1.0
# 1 1 3 4 3.0
# 2 2 0 0 1.2
# 5 3 2 5 6.00
basically what I want to happen is if the previous rank has any rows with x, y (+- 1 both ways) AND z (<1.1z) to remove it.
So for the rows rank 1 ANY rows in rank 2 that have any combo of x = (-1-1), y = (-1-1), z= (<1.1) OR x = (2-5), y = (3-5), z= (<3.3) I want it to be removed
Here's a solution using numpy broadcasting:
# Initially, no row is dropped
df['drop'] = False
for r in range(df['rank'].min(), df['rank'].max()):
# Find the x_min, x_max, y_min, y_max, z_max of the current rank
cond = df['rank'] == r
x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T
x_min, x_max = x + [[-1], [1]] # use numpy broadcasting to ±1 in one command
y_min, y_max = y + [[-1], [1]]
z_max = z * 1.1
# Find the x, y, z of the next rank. Raise them one dimension
# so that we can make a comparison matrix again x_min, x_max, ...
cond = df['rank'] == r + 1
if not cond.any():
continue
x, y, z = df.loc[cond, ['x','y','z']].to_numpy().T[:, :, None]
# Condition to drop a row
drop = (
(x_min <= x) & (x <= x_max) &
(y_min <= y) & (y <= y_max) &
(z <= z_max)
).any(axis=1)
df.loc[cond, 'drop'] = drop
# Result
df[~df['drop']]
Condensed
An even more condensed version (and likely faster). This is a really good way to puzzle your future teammates when they read the code:
r, x, y, z = df[['rank', 'x', 'y', 'z']].T.to_numpy()
rr, xx, yy, zz = [col[:,None] for col in [r, x, y, z]]
drop = (
(rr == r + 1) &
(x-1 <= xx) & (xx <= x+1) &
(y-1 <= yy) & (yy <= y+1) &
(zz <= z*1.1)
).any(axis=1)
# Result
df[~drop]
What this does is comparing every row in df against each other (including itself) and return True (i.e. drop) if:
The current row's rank == the other row's rank + 1; and
The current row's x, y, z fall within the specified range of the other row's x, y, z
You need to slightly modify my previous code:
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check is a previous group exists, else flag all rows False (i.e. not to be dropped)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if all differences are within 1/1/0.1*z for x/y/z
# for at least one rows of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*s['z']]).all(1).any(), axis=1)
groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]
output:
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
5 3 2 5 6.0
I have modified mozway's function so that it works according to your requirements.
# comparing 'equal' float values, may go wrong, that's why I am using this constant
DELTA=0.1**12
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check if a previous group exists, else flag all rows False (i.e. not to be dropped)
#return pd.Series(False, index=d.index)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if differences in x and y are within 1 and z < 1.1*x
# for at least one row of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: (abs(d_prev-s)[['x', 'y']].le([1,1]).all(1)&
(s['z']<1.1*d_prev['x']-DELTA)).any(), axis=1)
tests,
>>> df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.25, 3, 6],
})
>>> df
rank x y z
0 1 0 0 1.00
1 1 3 4 3.00
2 2 0 0 1.20
3 2 3 4 3.25
4 3 4 5 3.00
5 3 2 5 6.00
>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
5 3 2 5 6.0
>>> df = pd.DataFrame({
'rank': [1, 1, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 4, 2],
'y': [0, 4, 0, 4, 5, 5],
'z': [1, 3, 1.2, 3.3, 3, 6],
})
>>> df
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 2 3 4 3.3
4 3 4 5 3.0
5 3 2 5 6.0
>>> groups = df.groupby('rank')
>>> mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
>>> df[~mask]
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 2 3 4 3.3
5 3 2 5 6.0
Just takes an adjustment to the z term of the lamda equation from the linked post:
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)
Here's the full code that works for me:
df = pd.DataFrame({
'rank': [1, 1, 2, 2, 2, 3, 3],
'x': [0, 3, 0, 3, 3, 4, 2],
'y': [0, 4, 0, 4, 4, 5, 5],
'z': [1, 3, 1.2, 3.3, 3.31, 3, 6],
})
def check_previous_group(rank, d, groups):
if not rank-1 in groups.groups:
# check is a previous group exists, else flag all rows False (i.e. not to be dropped)
return pd.Series(False, index=d.index)
else:
# get previous group (rank-1)
d_prev = groups.get_group(rank-1)
# get the absolute difference per row with the whole dataset
# of the previous group: abs(d_prev-s)
# if all differences are within 1/1/0.1*z for x/y/z
# for at least one rows of the previous group
# then flag the row to be dropped (True)
return d.apply(lambda s: abs(d_prev-s)[['x', 'y', 'z']].le([1,1,.1*d_prev['z']]).all(1).any(), axis=1)
groups = df.groupby('rank')
mask = pd.concat([check_previous_group(rank, d, groups) for rank,d in groups])
df[~mask]
This works for me on Python 3.8.6
import pandas as pd
dfg = df.groupby("rank")
def filter_func(dfg):
for g in dfg.groups.keys():
if g-1 in dfg.groups.keys():
yield (
pd.merge(
dfg.get_group(g).assign(id = lambda df: df.index),
dfg.get_group(g-1),
how="cross", suffixes=("", "_prev")
).assign(
cond = lambda df: ~(
(df.x - df.x_prev).abs().le(1) & (df.y - df.y_prev).abs().le(1) & df.z.divide(df.z_prev).lt(1.1)
)
)
).groupby("id").agg(
{
**{"cond": "all"},
**{k: "first" for k in df.columns}
}).loc[lambda df: df.cond].drop(columns = ["cond"])
else:
yield dfg.get_group(g)
pd.concat(
filter_func(dfg), ignore_index=True
)
The output seems to match what you expected:
rank x y z
0 1 0 0 1.0
1 1 3 4 3.0
2 2 0 0 1.2
3 3 2 5 6.0
Small edit: in your question it seems like you care about the row index. The solution I posted just ignores this, but if you want to keep it, just save it as an additional column in the dataframe.
Let's say I have a Pandas DataFrame df:
Date Value
01/01/17 0
01/02/17 0
01/03/17 1
01/04/17 0
01/05/17 0
01/06/17 0
01/07/17 1
01/08/17 0
01/09/17 0
For each row, I want to efficiently calculate the days since the last occurence of Value=1.
So that df:
Date Value Last_Occurence
01/01/17 0 NaN
01/02/17 0 NaN
01/03/17 1 0
01/04/17 0 1
01/05/17 0 2
01/06/17 0 3
01/07/17 1 0
01/08/17 0 1
01/09/17 0 2
I could do a loop:
for i in range(0, len(df)):
last = np.where(df.loc[0:i,'Value']==1)
df.loc[i, 'Last_Occurence'] = i-last
But it seems very inefficient for extremely large data sets and probably isn't right anyway.
Here's a NumPy approach -
def intervaled_cumsum(a, trigger_val=1, start_val = 0, invalid_specifier=-1):
out = np.ones(a.size,dtype=int)
idx = np.flatnonzero(a==trigger_val)
if len(idx)==0:
return np.full(a.size,invalid_specifier)
else:
out[idx[0]] = -idx[0] + 1
out[0] = start_val
out[idx[1:]] = idx[:-1] - idx[1:] + 1
np.cumsum(out, out=out)
out[:idx[0]] = invalid_specifier
return out
Few sample runs on array data to showcase the usage covering various scenarios of trigger and start values :
In [120]: a
Out[120]: array([0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0])
In [121]: p1 = intervaled_cumsum(a, trigger_val=1, start_val=0)
...: p2 = intervaled_cumsum(a, trigger_val=1, start_val=1)
...: p3 = intervaled_cumsum(a, trigger_val=0, start_val=0)
...: p4 = intervaled_cumsum(a, trigger_val=0, start_val=1)
...:
In [122]: np.vstack(( a, p1, p2, p3, p4 ))
Out[122]:
array([[ 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0],
[-1, 0, 0, 0, 1, 2, 0, 1, 2, 0, 0, 0, 0, 0, 1],
[-1, 1, 1, 1, 2, 3, 1, 2, 3, 1, 1, 1, 1, 1, 2],
[ 0, 1, 2, 3, 0, 0, 1, 0, 0, 1, 2, 3, 4, 5, 0],
[ 1, 2, 3, 4, 1, 1, 2, 1, 1, 2, 3, 4, 5, 6, 1]])
Using it to solve our case :
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Sample output -
In [181]: df
Out[181]:
Date Value Last_Occurence
0 01/01/17 0 -1
1 01/02/17 0 -1
2 01/03/17 1 0
3 01/04/17 0 1
4 01/05/17 0 2
5 01/06/17 0 3
6 01/07/17 1 0
7 01/08/17 0 1
8 01/09/17 0 2
Runtime test
Approaches -
# #Scott Boston's soln
def pandas_groupby(df):
mask = df.Value.cumsum().replace(0,False).astype(bool)
return df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).\
cumsum()).cumcount().where(mask))
# Proposed in this post
def numpy_based(df):
df['Last_Occurence'] = intervaled_cumsum(df.Value.values)
Timings -
In [33]: df = pd.DataFrame((np.random.rand(10000000)>0.7).astype(int), columns=[['Value']])
In [34]: %timeit pandas_groupby(df)
1 loops, best of 3: 1.06 s per loop
In [35]: %timeit numpy_based(df)
10 loops, best of 3: 103 ms per loop
In [36]: df = pd.DataFrame((np.random.rand(100000000)>0.7).astype(int), columns=[['Value']])
In [37]: %timeit pandas_groupby(df)
1 loops, best of 3: 11.1 s per loop
In [38]: %timeit numpy_based(df)
1 loops, best of 3: 1.03 s per loop
Let's try this using cumsum, cumcount, and groupby:
mask = df.Value.cumsum().replace(0,False).astype(bool) #Mask starting zeros as NaN
df_out = df.assign(Last_Occurance=df.groupby(df.Value.astype(bool).cumsum()).cumcount().where(mask))
print(df_out)
output:
Date Value Last_Occurance
0 01/01/17 0 NaN
1 01/02/17 0 NaN
2 01/03/17 1 0.0
3 01/04/17 0 1.0
4 01/05/17 0 2.0
5 01/06/17 0 3.0
6 01/07/17 1 0.0
7 01/08/17 0 1.0
8 01/09/17 0 2.0
You can use argmax:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()),axis=1)
Out[85]:
0 0
1 0
2 0
3 1
4 2
5 3
6 0
7 1
8 2
dtype: int64
If you have to have nan for the first 2 rows, use:
df.apply(lambda x: np.argmax(df.iloc[x.name::-1].Value.tolist()) \
if 1 in df.iloc[x.name::-1].Value.values \
else np.nan,axis=1)
Out[86]:
0 NaN
1 NaN
2 0.0
3 1.0
4 2.0
5 3.0
6 0.0
7 1.0
8 2.0
dtype: float64
You don't have to update the value to last every step in the for loop. Initiate a variable outside the loop
last = np.nan
for i in range(len(df)):
if df.loc[i, 'Value'] == 1:
last = i
df.loc[i, 'Last_Occurence'] = i - last
and update it only when a 1 occurs in column Value.
Note that no matter what method you select, iterating the whole table once is inevitable.
I'd like to create a new dataframe using the same values from another dataframe, unless there is a 0 value. If there is a 0 value, I'd like to find the average of the entry before and after.
For Example:
df = A B C
5 2 1
3 4 5
2 1 0
6 8 7
I'd like the result to look like the df below:
df_new = A B C
5 2 1
3 4 5
2 1 6
6 8 7
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[5, 3, 2, 6], 'B':[2, 4, 1, 8], 'C':[1, 5, 0, 7]})
Nrows = len(df)
def run(col):
originalValues = list(df[col])
values = list(np.where(np.array(list(df[col])) == 0)[0])
indices2replace = filter(lambda x: x > 0 and x < Nrows, values)
for index in indices2replace:
originalValues[index] = 0.5 * (originalValues[index+1] + originalValues[index-1])
return originalValues
newDF = pd.DataFrame(map(lambda x: run(x) , df.columns)).transpose()