How to remove consecutive pairs of opposite numbers from Pandas Dataframe? - python

How can i remove consecutive pairs of equal numbers with opposite signs from a Pandas dataframe?
Assuming i have this input dataframe
incremental_changes = [2, -2, 2, 1, 4, 5, -5, 7, -6, 6]
df = pd.DataFrame({
'idx': range(len(incremental_changes)),
'incremental_changes': incremental_changes
})
idx incremental_changes
0 0 2
1 1 -2
2 2 2
3 3 1
4 4 4
5 5 5
6 6 -5
7 7 7
8 8 -6
9 9 6
I would like to get the following
idx incremental_changes
0 0 2
3 3 1
4 4 4
7 7 7
Note that the first 2 could either be idx 0 or 2, it doesn't really matter.
Thanks

Can groupby consecutive equal numbers and transform
import itertools
def remove_duplicates(s):
''' Generates booleans that indicate when a pair of ints with
opposite signs are found.
'''
iter_ = iter(s)
for (a,b) in itertools.zip_longest(iter_, iter_):
if b is None:
yield False
else:
yield a+b == 0
yield a+b == 0
>>> mask = df.groupby(df['incremental_changes'].abs().diff().ne(0).cumsum()) \
['incremental_changes'] \
.transform(remove_duplicates)
Then
>>> df[~mask]
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7

Just do rolling, then we filter the multiple combine
s = df.incremental_changes.rolling(2).sum()
s = s.mask(s[s==0].groupby(s.ne(0).cumsum()).cumcount()==1)==0
df[~(s | s.shift(-1))]
Out[640]:
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7

Related

np.array index slicing bethween conditions

I need to slice an array's index from where a first condition is true to where a second condition is true, these conditions are never true at the same time, but one can be true more than one time before the other occurs.
I try to explain:
array_filter = np.array([3,4,5,6,4,3,2,3,4,5])
array1 = np.array([2,3,4,6,3,3,1,2,3,4])
array2 = np.array([3,5,6,7,5,4,3,3,5,6])
array1_cond = array1 >= array_filter
array2_cond = array2 <= array_filter
0 1 2 3 4 5 6 7 8 9
array_filter 3 4 5 6 4 3 2 3 4 5
array1 2 3 4 6 3 3 1 2 3 4
array1_cond ^ ^ (^ = True)
array2 3 5 6 7 5 4 3 3 5 6
array2_cond ^ ^
expected_output 2 3 4 | 7 5 4 3 | 2 3 4
array1 | array2 | array1
EXPECTED OUTPUT:
expected_output[(array2_cond) : (array1_cond)] = array1[(array2_cond) : (array1_cond)]
expected_output[(array1_cond) : (array2_cond)] = array2[(array1_cond) : (array2_cond)]
expected_output = [ 2, 3, 4, 7, 5, 4, 3, 2, 3, 4 ]
I'm so sorry if syntax is a little confusing, but idk how to make it better... <3
How can I perform this?
Is it possible WITHOUT LOOPS?
This works for your example, with a, b in place of array1, array2:
nz = np.flatnonzero(a_cond | b_cond)
lengths = np.diff(nz, append=len(a))
cond = np.repeat(b_cond[nz], lengths)
result = np.where(cond, a, b)
If at the start of the arrays neither condition holds true then elements from b are selected.

Turn list clockwise for one time

How I can rotate list clockwise one time? I have some temporary solution, but I'm sure there is a better way to do it.
I want to get from this
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 4 4 5 6 6 7 7 7
to this:
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 0 2 4 4 5 6 6 7 7
And my temporary "solution" is just:
temporary = [0, 2, 4, 4, 5, 6, 6, 7, 7, 7]
test = [None] * len(temporary)
test[0] = temporary[0]
for index in range(1, len(temporary)):
test[index] = temporary[index - 1]
You might use temporary.pop() to discard the last item and temporary.insert(0, 0) to add 0 to the front.
Alternatively in one line:
temporary = [0] + temporary[:-1]

Drop rows if value in column changes

Assume I have the following pandas data frame:
my_class value
0 1 1
1 1 2
2 1 3
3 2 4
4 2 5
5 2 6
6 2 7
7 2 8
8 2 9
9 3 10
10 3 11
11 3 12
I want to identify the indices of "my_class" where the class changes and remove n rows after and before this index. The output of this example (with n=2) should look like:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
My approach:
# where class changes happen
s = df['my_class'].ne(df['my_class'].shift(-1).fillna(df['my_class']))
# mask with `bfill` and `ffill`
df[~(s.where(s).bfill(limit=1).ffill(limit=2).eq(1))]
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12
One of possible solutions is to:
Make use of the fact that the index contains consecutive integers.
Find index values where class changes.
For each such index generate a sequence of indices from n-2
to n+1 and concatenate them.
Retrieve rows with indices not in this list.
The code to do it is:
ind = df[df['my_class'].diff().fillna(0, downcast='infer') == 1].index
df[~df.index.isin([item for sublist in
[ range(i-2, i+2) for i in ind ] for item in sublist])]
my_class = np.array([1] * 3 + [2] * 6 + [3] * 3)
cols = np.c_[my_class, np.arange(len(my_class)) + 1]
df = pd.DataFrame(cols, columns=['my_class', 'value'])
df['diff'] = df['my_class'].diff().fillna(0)
idx2drop = []
for i in df[df['diff'] == 1].index:
idx2drop += range(i - 2, i + 2)
print(df.drop(idx_drop)[['my_class', 'value']])
Output:
my_class value
0 1 1
5 2 6
6 2 7
11 3 12

Rolling sum on a dynamic window

I am new to python and the last time I coded was in the mid-80's so I appreciate your patient help.
It seems .rolling(window) requires the window to be a fixed integer. I need a rolling window where the window or lookback period is dynamic and given by another column.
In the table below, I seek the Lookbacksum which is the rolling sum of Data as specified by the Lookback column.
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
eg:
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
You can create a custom function for use with df.apply, eg:
def lookback_window(row, values, lookback, method='sum', *args, **kwargs):
loc = values.index.get_loc(row.name)
lb = lookback.loc[row.name]
return getattr(values.iloc[loc - lb: loc + 1], method)(*args, **kwargs)
Then use it as:
df['new_col'] = df.apply(lookback_window, values=df['Data'], lookback=df['Lookback'], axis=1)
There may be some corner cases but as long as your indices align and are unique - it should fulfil what you're trying to do.
here is one with a list comprehension which stores the index and value of the column df['Lookback'] and the gets the slice by reversing the values and slicing according to the column value:
df['LookbackSum'] = [sum(df.loc[:e,'Data'][::-1].to_numpy()[:i+1])
for e,i in enumerate(df['Lookback'])]
print(df)
Data Lookback LookbackSum
0 1 0 1
1 1 1 2
2 1 2 3
3 2 2 4
4 3 1 5
5 2 3 8
6 3 3 10
7 2 2 7
8 1 3 8
9 2 1 3
An exercise in pain, if you want to try an almost fully vectorized approach. Sidenote: I don't think it's worth it here. At all.
Inspired by Divakar's answer here
Given:
import numpy as np
import pandas as pd
d={'Data':[1,1,1,2,3,2,3,2,1,2],
'Lookback':[0,1,2,2,1,3,3,2,3,1],
'LookbackSum':[1,2,3,4,5,8,10,7,8,3]}
df=pd.DataFrame(data=d)
Using the function from Divakar's answer, but slightly modified
from skimage.util.shape import view_as_windows as viewW
def strided_indexing_roll(a, r, fill_value=np.nan):
# Concatenate with sliced to cover all rolls
p = np.full((a.shape[0],a.shape[1]-1),fill_value)
a_ext = np.concatenate((p,a,p),axis=1)
# Get sliding windows; use advanced-indexing to select appropriate ones
n = a.shape[1]
return viewW(a_ext,(1,n))[np.arange(len(r)), -r + (n-1),0]
Now, we just need to prepare a 2d array for the data and independently shift the rows according to our desired lookback values.
arr = df['Data'].to_numpy().reshape(1, -1).repeat(len(df), axis=0)
shifter = np.arange(len(df) - 1, -1, -1) #+ d['Lookback'] - 1
temp = strided_indexing_roll(arr, shifter, fill_value=0)
out = strided_indexing_roll(temp, (len(df) - 1 - df['Lookback'])*-1, 0).sum(-1)
Output:
array([ 1, 2, 3, 4, 5, 8, 10, 7, 8, 3], dtype=int64)
We can then just assign it back to the dataframe as needed and check.
df['out'] = out
#output:
Data Lookback LookbackSum out
0 1 0 1 1
1 1 1 2 2
2 1 2 3 3
3 2 2 4 4
4 3 1 5 5
5 2 3 8 8
6 3 3 10 10
7 2 2 7 7
8 1 3 8 8
9 2 1 3 3

Best way to split a DataFrame given an edge

Suppose I have the following DataFrame:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
And I need to group it given the "edge B"; that means the groups will be:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
That is any time I find a 'B' in the column 'a' I want to split my DataFrame.
My current solution is:
#create the dataframe
s = pd.Series(['A','A','A','B','A','A','B','A','A'])
ss = pd.Series(np.random.randn(9))
dff = pd.DataFrame({"a":s,"b":ss})
#my solution
count = 0
ls = []
for i in s:
if i=="A":
ls.append(count)
else:
ls.append(count)
count+=1
dff['grpb']=ls
and I got the dataframe:
a b grpb
0 A 1.516733 0
1 A 0.035646 0
2 A -0.942834 0
3 B -0.157334 0
4 A 2.226809 1
5 A 0.768516 1
6 B -0.015162 1
7 A 0.710356 2
8 A 0.151429 2
Which I can then split with dff.groupby('grpb').
Is there a more efficient way to do this using pandas' functions?
here's a oneliner:
zip(*dff.groupby(pd.rolling_median((1*(dff['a']=='B')).cumsum(),3,True)))[-1]
[ 1 2
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334,
1 2
4 A 2.226809
5 A 0.768516
6 B -0.015162,
1 2
7 A 0.710356
8 A 0.151429]
How about:
df.groupby((df.a == "B").shift(1).fillna(0).cumsum())
For example:
>>> df
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072
4 A -1.903361
5 A 1.436268
6 B 0.391087
7 A -0.907679
8 A 1.672897
>>> gg = list(df.groupby((df.a == "B").shift(1).fillna(0).cumsum()))
>>> pprint.pprint(gg)
[(0,
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072),
(1, a b
4 A -1.903361
5 A 1.436268
6 B 0.391087),
(2, a b
7 A -0.907679
8 A 1.672897)]
(I didn't bother getting rid of the indices; you could use [g for k, g in df.groupby(...)] if you liked.)
An alternative is:
In [36]: dff
Out[36]:
a b
0 A 0.689785
1 A -0.374623
2 A 0.517337
3 B 1.549259
4 A 0.576892
5 A -0.833309
6 B -0.209827
7 A -0.150917
8 A -1.296696
In [37]: dff['grpb'] = np.NaN
In [38]: breaks = dff[dff.a == 'B'].index
In [39]: dff['grpb'][breaks] = range(len(breaks))
In [40]: dff.fillna(method='bfill').fillna(len(breaks))
Out[40]:
a b grpb
0 A 0.689785 0
1 A -0.374623 0
2 A 0.517337 0
3 B 1.549259 0
4 A 0.576892 1
5 A -0.833309 1
6 B -0.209827 1
7 A -0.150917 2
8 A -1.296696 2
Or using itertools to create 'grpb' is an option too.
def vGroup(dataFrame, edgeCondition, groupName='autoGroup'):
groupNum = 0
dataFrame[groupName] = ''
#loop over each row
for inx, row in dataFrame.iterrows():
if edgeCondition[inx]:
dataFrame.ix[inx, groupName] = 'edge'
groupNum += 1
else:
dataFrame.ix[inx, groupName] = groupNum
return dataFrame[groupName]
vGroup(df, df[0] == ' ')

Categories

Resources