Pandas replace all but first in consecutive group - python

The problem description is simple, but I cannot figure how to make this work in Pandas. Basically, I'm trying to replace consecutive values (except the first) with some replacement value. For example:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 2
10 2
11 2
12 3
If I run this through some function foo(df, 2, 0) I would get the following:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
Which replaces all values of 2 with 0, except for the first one. Is this possible?

You can find all the rows where A = 2 and A is also equal to the previous A value and set them to 0:
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame.from_dict(data)
df[(df.A == 2) & (df.A == df.A.shift(1))] = 0
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3
If you have more than one column in the dataframe, use df.loc to just set the A values:
df.loc[(df.A == 2) & (df.A == df.A.shift(1)), 'A'] = 0

Try, if 'A' is duplicated further down the datafame, an is monotonic increasing:
def foo(df, val=2, repl=0):
return df.mask((df.groupby('A').transform('cumcount') > 0) & (df['A'] == val), repl)
foo(df, 2, 0)
Output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

I'm not sure if this is the best way, but I came up with this solution, hope to be helpful:
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replecate(df, number, replacement):
i = 1
for column in df.columns:
for index,value in enumerate(df[column]):
if i == 1 and value == number :
i = 0
elif value == number and i != 1:
df[column][index] = replacement
i = 1
return df
replecate(df, 2 , 0)
Output
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

I've managed a solution to this problem by shifting the row down by one and checking to see if the values align. Also included a function which can take multiple values to check for (not just 2).
import pandas as pd
data = {
"A": [0, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2, 3]
}
df = pd.DataFrame(data)
def replace_recurring(df,key,offset=1,values=[2]):
df['offset'] = df[key].shift(offset)
df.loc[(df[key]==df['offset']) & (df[key].isin(values)),key] = 0
df = df.drop(['offset'],axis=1)
return df
df = replace_recurring(df,'A',offset=1,values=[2])
Giving the output:
A
0 0
1 1
2 1
3 1
4 0
5 0
6 0
7 0
8 2
9 0
10 0
11 0
12 3

Related

Compare two pandas DataFrames in the most efficient way

Let's consider two pandas dataframes:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
If want to do the following thing:
If df[1] > check_df[1] or df[2] > check_df[1] or df[3] > check_df[1] then we assign to df 1, and 0 otherwise
If df[2] > check_df[2] or df[3] > check_df[2] or df[4] > check_df[2] then we assign to df 1, and 0 otherwise
We apply the same algorithm to end of DataFrame
My primitive code is the following:
df_copy = df.copy()
for i in range(len(df) - 3):
moving_df = df.iloc[i:i+3]
if (moving_df >check_df.iloc[i]).any()[0]:
df_copy.iloc[i] = 1
else:
df_copy.iloc[i] = -1
df_copy
0
0 -1
1 1
2 -1
3 1
4 1
5 -1
6 3
7 6
8 7
Could you please give me a advice, if there is any possibility to do this without loop?
IIUC, this is easily done with a rolling.min:
df['out'] = np.where(df[0].rolling(N, min_periods=1).max().shift(1-N).gt(check_df[0]),
1, -1)
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 -1
8 7 -1
to keep the last items as is:
m = df[0].rolling(N).max().shift(1-N)
df['out'] = np.where(m.gt(check_df[0]),
1, -1)
df['out'] = df['out'].mask(m.isna(), df[0])
output:
0 out
0 1 -1
1 2 1
2 3 -1
3 2 1
4 5 1
5 4 -1
6 3 1
7 6 6
8 7 7
Although #mozway has already provided a very smart solution, I would like to share my approach as well, which was inspired by this post.
You could create your own object that compares a series with a rolling series. The comparison could be performed by typical operators, i.e. >, < or ==. If at least one comparison holds, the object would return a pre-defined value (given in list returns_tf, where the first element would be returned if the comparison is true, and the second if it's false).
Possible Code:
import numpy as np
import pandas as pd
df = pd.DataFrame([1, 2, 3, 2, 5, 4, 3, 6, 7])
check_df = pd.DataFrame([3, 2, 5, 4, 3, 6, 4, 2, 1])
class RollingComparison:
def __init__(self, comparing_series: pd.Series, rolling_series: pd.Series, window: int):
self.comparing_series = comparing_series.values[:-1*window]
self.rolling_series = rolling_series.values
self.window = window
def rolling_window_mask(self, option: str = "smaller"):
shape = self.rolling_series.shape[:-1] + (self.rolling_series.shape[-1] - self.window + 1, self.window)
strides = self.rolling_series.strides + (self.rolling_series.strides[-1],)
rolling_window = np.lib.stride_tricks.as_strided(self.rolling_series, shape=shape, strides=strides)[:-1]
rolling_window_mask = (
self.comparing_series.reshape(-1, 1) < rolling_window if option=="smaller" else (
self.comparing_series.reshape(-1, 1) > rolling_window if option=="greater" else self.comparing_series.reshape(-1, 1) == rolling_window
)
)
return rolling_window_mask.any(axis=1)
def assign(self, option: str = "rolling", returns_tf: list = [1, -1]):
mask = self.rolling_window_mask(option)
return np.concatenate((np.where(mask, returns_tf[0], returns_tf[1]), self.rolling_series[-1*self.window:]))
The assignments can be achieved as follows:
roller = RollingComparison(check_df[0], df[0], 3)
check_df["rolling_smaller_checking"] = roller.assign(option="smaller")
check_df["rolling_greater_checking"] = roller.assign(option="greater")
check_df["rolling_equals_checking"] = roller.assign(option="equal")
Output (the column rolling_smaller_checking equals your desired output):
0 rolling_smaller_checking rolling_greater_checking rolling_equals_checking
0 3 -1 1 1
1 2 1 -1 1
2 5 -1 1 1
3 4 1 1 1
4 3 1 -1 1
5 6 -1 1 1
6 4 3 3 3
7 2 6 6 6
8 1 7 7 7

Count only first occurrence of each sequence python

I have some acceleration data that I have set up a new column to give a 1 if the accel value in the accelpos column >=2.5 using the following code
frame["new3"] = np.where((frame.accelpos >=2.5), '1', '0')
I end up getting data in sequences like so
0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0
I want to add a second column to give a 1 just at the start of each sequence as follows
0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
Any help would be much apreciated
You can compare shifted values by Series.shift and get values only for '1', so chain conditions by & for bitwise AND and last casting to integers for True/False to 1/0 mapping:
df = pd.DataFrame({'col':'0,0,0,0,1,1,1,1,1,0,0,0,1,1,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0'.split(',')})
df['new'] = (df['col'].ne(df['col'].shift()) & df['col'].eq('1')).astype(int)
Or test difference, but because possible first 1 is necessary replace missing value by original with fillna:
s = df['col'].astype(int)
df['new'] = s.diff().fillna(s).eq(1).astype(int)
print (df)
col new
0 0 0
1 0 0
2 0 0
3 0 0
4 1 1
5 1 0
6 1 0
7 1 0
8 1 0
9 0 0
10 0 0
11 0 0
12 1 1
13 1 0
14 0 0
15 0 0
16 0 0
17 1 1
18 1 0
19 1 0
20 1 0
21 1 0
22 1 0
23 1 0
24 1 0
25 1 0
26 1 0
27 0 0
28 0 0
29 0 0
30 0 0
I am not familiar with the where function. I guess i might try and help from an algorithmic point of view.
Assume we have a list a = [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, ..., 0]
From an algorithmic POV if you want to replace each sequence of 1 with a unique one at the begining of such sequence here is what you want to do :
parse the list
assess whether it is a one or a zero
if it is a one then, each following item must be a 0 until you actually have a zero
You might want to have something like this :
a = [0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1]
for i in range(len(a)-1):
if a[i] == 1 :
for j in range(1,len(a)-i):
if a[i+j] == 1:
a[i+j] = 0
else :
break

how to print ones and zeros in columns with their indexes in python?

I have a list of zeros and ones, I want to print them in two different columns with headings and index numbers. Something like this.
list = [1,0,1,1,1,0,1,0,1,0,0]
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
This is the desired output.
I tried this:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
if ele==1:
print(index,ele,end=" ")
elif ele==0:
print(" ")
print(index,ele,end=" ")
else:
print()
But this gives output like this:
ones zeros
1 1
2 0 3 1 4 1 5 1
6 0 7 1
8 0 9 1
10 0
11 0
How do get the desired output?
Any help is appreciated.
You can use itertools.zip_longest, str.ljust, f-strings (for formatting), and some calculations for the printing part, and use two lists to hold the indices of both zeros and ones:
l = [1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0]
ones, zeros = [], []
max_len_zeros = max_len_ones = 0
for index, num in enumerate(l, 1):
if num == 0:
zeros.append(index)
max_len_zeros = max(max_len_zeros, len(str(index)))
else:
ones.append(index)
max_len_ones = max(max_len_ones, len(str(index)))
from itertools import zip_longest
print('ones' + ' ' * (max_len_ones + 2) + 'zeros')
for ones_index, zeros_index in zip_longest(ones, zeros, fillvalue = ''):
one = '1' if ones_index else ' '
this_one_index = str(ones_index).ljust(max_len_ones)
zero = '0' if zeros_index else ''
this_zero_index = str(zeros_index).ljust(max_len_zeros)
print(f'{this_one_index} {one} {this_zero_index} {zero}')
Output:
ones zeros
1 1 2 0
3 1 6 0
4 1 8 0
5 1 10 0
7 1 11 0
9 1
List with more zeros than ones:
In: l = [1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0]
Out:
ones zeros
1 1 2 0
4 1 3 0
7 1 5 0
9 1 6 0
10 1 8 0
14 1 11 0
12 0
13 0
15 0
List with equal number of zeros and ones:
In: l = [1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1]
Out:
ones zeros
1 1 2 0
3 1 4 0
5 1 6 0
8 1 7 0
9 1 10 0
11 1 13 0
12 1 14 0
15 1 16 0
18 1 17 0
20 1 19 0
It's hard to do what you need in an iterative way. I have kind of a "broken" solution that both shows how you could better do what you are trying to do and why an iterative approach is limited in this case.
I updated your code as following:
list = [1,0,1,1,1,0,1,0,1,0,0]
print('ones',end='\t')
print('zeros')
for index,ele in enumerate(list,start=1):
# First check if extra space OR new lines OR both are needed
if index > 1:
if ele==1:
print()
elif ele==0:
if list[index-2]==1:
print('', end=' \t')
else:
print('', end='\n\t\t')
# THEN, write your desired output without any end
if ele==1:
print(index,ele,end="")
elif ele==0:
print(index,ele,end="")
# Finally an empty line
print()
It gives the following ouput:
ones zeros
1 1 2 0
3 1
4 1
5 1 6 0
7 1 8 0
9 1 10 0
11 0
As you can see, its limitation is that you can't go "up" and rewrite in old lines.
However, if you need to display EXACTLY as you've shown, you need to construct an intermediate data structure (for example a dict) and then display it using zip

Pandas Mapping Numbers to another Number

I have ~5000 rows and all values in my 'Round' column go from -1 to 7. I'm trying to create a new column and it mapped where -1 = 0 and then anything from 1-7 is 1. I tried a simple map and listed all the mappings, but this doesn't work.
combine['Drafted'] = combine.Round.map({'-1':0,'1':1,'2':1,'3':1,'4':1,'5':1,'6':1,'7':1})
Is there something wrong with the logic above that it wouldn't work?
I guess you can achieve it using below code:
df = pd.DataFrame({'Round': [-1, 1, 0, 7, -1, 2, 3, 5, -1, 4, 6]})
df['Drafted'] = np.where(df['Round'] == -1, 0, 1)
print(df)
And the output is as below:
Round Drafted
0 -1 0
1 1 1
2 0 1
3 7 1
4 -1 0
5 2 1
6 3 1
7 5 1
8 -1 0
9 4 1
10 6 1

Creating intervaled ramp array based on a threshold - Python / NumPy

I would like to measure the length of a sub-array fullfilling some condition (like a stop clock), but as soon as the condition is not fulfilled any more, the value should reset to zero. So, the resulting array should tell me, how many values fulfilled some condition (e.g. value > 1):
[0, 0, 2, 2, 2, 2, 0, 3, 3, 0]
should result into the followin array:
[0, 0, 1, 2, 3, 4, 0, 1, 2, 0]
One can easily define a function in python, which returns the corresponding numy array:
def StopClock(signal, threshold=1):
clock = []
current_time = 0
for item in signal:
if item > threshold:
current_time += 1
else:
current_time = 0
clock.append(current_time)
return np.array(clock)
StopClock([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
However, I really do not like this for-loop, especially since this counter should run over a longer dataset. I thought of some np.cumsum solution in combination with np.diff, however I do not get through the reset part. Is someone aware of a more elegant numpy-style solution of above problem?
This solution uses pandas to perform a groupby:
s = pd.Series([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
threshold = 0
>>> np.where(
s > threshold,
s
.to_frame() # Convert series to dataframe.
.assign(_dummy_=1) # Add column of ones.
.groupby((s.gt(threshold) != s.gt(threshold).shift()).cumsum())['_dummy_'] # shift-cumsum pattern
.transform(lambda x: x.cumsum()), # Cumsum the ones per group.
0) # Fill value with zero where threshold not exceeded.
array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])
Yes, we can use diff-styled differentiation alongwith cumsum to create such intervaled ramps in a vectorized manner and that should be pretty efficient specially with large input arrays. The resetting part is taken care of by assigning appropriate values at the end of each interval, with the idea of cum-summing that resets the numbers at end of each interval.
Here's one implementation to accomplish all that -
def intervaled_ramp(a, thresh=1):
mask = a>thresh
# Get start, stop indices
mask_ext = np.concatenate(([False], mask, [False] ))
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
s0,s1 = idx[::2], idx[1::2]
out = mask.astype(int)
valid_stop = s1[s1<len(a)]
out[valid_stop] = s0[:len(valid_stop)] - valid_stop
return out.cumsum()
Sample runs -
Input (a) :
[5 3 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[1 2 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 1]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 0]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=1)) :
[0 0 0 1 2 0 0 1 2 3 4 0 1 2 0 0 0 1 0 1 2 3 4 0 1]
Input (a) :
[1 1 1 4 5 0 0 2 2 2 2 0 3 3 0 1 1 2 0 3 5 4 3 0 5]
Output (intervaled_ramp(a, thresh=0)) :
[1 2 3 4 5 0 0 1 2 3 4 0 1 2 0 1 2 3 0 1 2 3 4 0 1]
Runtime test
One way to do a fair benchmarking was to use the posted sample in the question and tiling into a big number of times and using that as the input array. With that setup, here's the timings -
In [841]: a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
In [842]: a = np.tile(a,10000)
# #Alexander's soln
In [843]: %timeit pandas_app(a, threshold=1)
1 loop, best of 3: 3.93 s per loop
# #Psidom 's soln
In [844]: %timeit stop_clock(a, threshold=1)
10 loops, best of 3: 119 ms per loop
# Proposed in this post
In [845]: %timeit intervaled_ramp(a, thresh=1)
1000 loops, best of 3: 527 µs per loop
Another numpy solution:
import numpy as np
a = np.array([0, 0, 2, 2, 2, 2, 0, 3, 3, 0])
​
def stop_clock(signal, threshold=1):
mask = signal > threshold
indices = np.flatnonzero(np.diff(mask)) + 1
return np.concatenate(list(map(np.cumsum, np.array_split(mask, indices))))
​
stop_clock(a)
# array([0, 0, 1, 2, 3, 4, 0, 1, 2, 0])

Categories

Resources