TLDR version:
I have a column like below,
[2, 2, 0, 0, 0, 2, 2, 0, 3, 3, 3, 0, 0, 2, 2, 0, 0, 0, 0, 2, 2, 0, 0, 0, 3, 3, 3]
# There is the probability that has more sequences, like 4, 5, 6, 7, 8...
I need a function that has parameters n,m, if I use
n=2, m=3,
I will get a distance between 2 and 3, and then final result after the group could be :
[6, 9]
Detailed version
Here is the test case. And I'm writing a function that will give n,m then generate a list of distances between each consecutive. Currently, this function can only work with one parameter N (which is the distance from N consecutive to another N consecutive). I want to make some changes to this function to make it accept M.
dummy = [1,1,0,0,0,1,1,0,1,1,1,0,0,1,1,0,0,0,0,1,1,0,0,0,1,1,1]
df = pd.DataFrame({'a': dummy})
What I write currently,
def get_N_seq_stat(df, N=2, M=3):
df["c1"] = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
df["c2"] = np.where(df.c1.ne(N) , 1, 0)
df["c3"] = df["c2"].ne(df["c2"].shift()).cumsum()
result = df.loc[df["c2"] == 1].groupby("c3")["c2"].count().tolist()
# if last N rows are not consequence shouldn't add last.
if not (df["c1"].tail(N) == N).all():
del result[-1]
if not (df["c1"].head(N) == N).all():
del result[0]
return result
if I set N=2, M=3 ( from 2 consecutive to 3 consecutive), Then the ideal value return from this would be [6,9] because below.
dummy = [1,1,**0,0,0,1,1,0,**1,1,1,0,0,1,1,**0,0,0,0,1,1,0,0,0,**1,1,1]
Currently, if I set N =2, the return list would be [3, 6, 4] that because
dummy = [1,1,**0,0,0,**1,1,**0,1,1,1,0,0,**1,1,**0,0,0,0,**1,1,0,0,0,1,1,1]
I would modify your code this way:
def get_N_seq_stat(df, N=2, M=3, debug=False):
# get number of consecutive 1s
c1 = (
df.groupby(df.a.ne(df.a.shift()).cumsum())["a"]
.transform("size")
.where(df.a.eq(1), 0)
)
# find stretches between N and M
m1 = c1.eq(N)
m2 = c1.eq(M)
c2 = pd.Series(np.select([m1.shift()&~m1, m2], [True, False], np.nan),
index=df.index).ffill().eq(1)
# debug mode to understand how this works
if debug:
return df.assign(c1=c1, c2=c2,
length=c2[c2].groupby(c2.ne(c2.shift()).cumsum())
.transform('size')
)
# get the length of the stretches
return c2[c2].groupby(c2.ne(c2.shift()).cumsum()).size().to_list()
get_N_seq_stat(df, N=2, M=3)
Output: [6, 9]
Intermediate c1, c2, and length:
get_N_seq_stat(df, N=2, M=3, debug=True)
a c1 c2 length
0 1 2 False NaN
1 1 2 False NaN
2 0 0 True 6.0
3 0 0 True 6.0
4 0 0 True 6.0
5 1 2 True 6.0
6 1 2 True 6.0
7 0 0 True 6.0
8 1 3 False NaN
9 1 3 False NaN
10 1 3 False NaN
11 0 0 False NaN
12 0 0 False NaN
13 1 2 False NaN
14 1 2 False NaN
15 0 0 True 9.0
16 0 0 True 9.0
17 0 0 True 9.0
18 0 0 True 9.0
19 1 2 True 9.0
20 1 2 True 9.0
21 0 0 True 9.0
22 0 0 True 9.0
23 0 0 True 9.0
24 1 3 False NaN
25 1 3 False NaN
26 1 3 False NaN
How can i remove consecutive pairs of equal numbers with opposite signs from a Pandas dataframe?
Assuming i have this input dataframe
incremental_changes = [2, -2, 2, 1, 4, 5, -5, 7, -6, 6]
df = pd.DataFrame({
'idx': range(len(incremental_changes)),
'incremental_changes': incremental_changes
})
idx incremental_changes
0 0 2
1 1 -2
2 2 2
3 3 1
4 4 4
5 5 5
6 6 -5
7 7 7
8 8 -6
9 9 6
I would like to get the following
idx incremental_changes
0 0 2
3 3 1
4 4 4
7 7 7
Note that the first 2 could either be idx 0 or 2, it doesn't really matter.
Thanks
Can groupby consecutive equal numbers and transform
import itertools
def remove_duplicates(s):
''' Generates booleans that indicate when a pair of ints with
opposite signs are found.
'''
iter_ = iter(s)
for (a,b) in itertools.zip_longest(iter_, iter_):
if b is None:
yield False
else:
yield a+b == 0
yield a+b == 0
>>> mask = df.groupby(df['incremental_changes'].abs().diff().ne(0).cumsum()) \
['incremental_changes'] \
.transform(remove_duplicates)
Then
>>> df[~mask]
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7
Just do rolling, then we filter the multiple combine
s = df.incremental_changes.rolling(2).sum()
s = s.mask(s[s==0].groupby(s.ne(0).cumsum()).cumcount()==1)==0
df[~(s | s.shift(-1))]
Out[640]:
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7
How I can rotate list clockwise one time? I have some temporary solution, but I'm sure there is a better way to do it.
I want to get from this
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 2 4 4 5 6 6 7 7 7
to this:
Index: 0 1 2 3 4 5 6 7 8 9
Count: 0 0 2 4 4 5 6 6 7 7
And my temporary "solution" is just:
temporary = [0, 2, 4, 4, 5, 6, 6, 7, 7, 7]
test = [None] * len(temporary)
test[0] = temporary[0]
for index in range(1, len(temporary)):
test[index] = temporary[index - 1]
You might use temporary.pop() to discard the last item and temporary.insert(0, 0) to add 0 to the front.
Alternatively in one line:
temporary = [0] + temporary[:-1]
I am trying to add a new column to a pandas Dataframe (False/True),which reflects if the Value is between two datapoints from another file.
I have a two files which give the following info:
File A:(x) File B:(y)
'time' 'time_A' 'time_B'
0 1 0 1 3
1 3 1 5 6
2 5 2 8 10
3 7
4 9
5 11
6 13
I tried to do it with the .map function, however it gives true and false for each event, not one column.
x['Event'] = x['time'].map((lamda x: x< y['time_A']),(lamda x: x> y['time_B']))
This would be the expected result
->
File A:
'time' 'Event'
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
However what i get is something like this
->
File A:
'time'
0 1 "0 True
1 True
2 True"
Name:1, dtype:bool"
2 3 "0 True
1 True
2 True
Name:1, dtype:bool"
This should do it:
(x.assign(key=1)
.merge(y.assign(key=1),
on='key')
.drop('key', 1)
.assign(Event=lambda v: (v['time_A'] <= v['time']) &
(v['time'] <= v['time_B']))
.groupby('time', as_index=False)['Event']
.any())
time Event
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
Use pd.IntervalIndex here:
idx=pd.IntervalIndex.from_arrays(B['time_A'],B['time_B'],closed='both')
#output-> IntervalIndex([[1, 3], [5, 6], [8, 10]],closed='both',dtype='interval[int64]')
A['Event']=B.set_index(idx).reindex(A['time']).notna().all(1).to_numpy()
print(A)
time Event
0 1 True
1 3 True
2 5 True
3 7 False
4 9 True
5 11 False
6 13 False
One liner:
A['Event'] = sum(A.time.between(b.time_A, b.time_B) for _, b in B.iterrows()) > 0
Explain:
For each row b of B dataframe, A.time.between(b.time_A, b.time_B) returns a boolean series whether time is between time_A and time_B
sum(list_of_boolean_series) > 0: Elementwise OR
I want to assign values based on a condition on index in Pandas DataFrame.
class test():
def __init__(self):
self.l = 1396633637830123000
self.dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = arange(self.l,self.l+10))
self.dfb = pd.DataFrame([[self.l+1,self.l+3], [self.l+6,self.l+9]], columns = ['beg', 'end'])
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = np.nan
for i, beg, end in zip(self.dfb.index, self.dfb['beg'], self.dfb['end']):
self.dfa.ix[beg:end]['true'] = True
self.dfa.ix[beg:end]['idx'] = i
def do(self):
self.update()
print self.dfa
t = test()
t.do()
Result:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True NaN
1396633637830123002 4 5 True NaN
1396633637830123003 6 7 True NaN
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True NaN
1396633637830123007 14 15 True NaN
1396633637830123008 16 17 True NaN
1396633637830123009 18 19 True NaN
The true column is correctly assigned, while the idx column is not. Futhermore, this seems to depend on how the columns are initialized because if I do:
def update(self):
self.dfa['true'] = False
self.dfa['idx'] = False
also the true column does not get properly assigned.
What am I doing wrong?
p.s. the expected result is:
A B true idx
1396633637830123000 0 1 False NaN
1396633637830123001 2 3 True 0
1396633637830123002 4 5 True 0
1396633637830123003 6 7 True 0
1396633637830123004 8 9 False NaN
1396633637830123005 10 11 False NaN
1396633637830123006 12 13 True 1
1396633637830123007 14 15 True 1
1396633637830123008 16 17 True 1
1396633637830123009 18 19 True 1
Edit: I tried assigning using both loc and iloc but it doesn't seem to work:
loc:
self.dfa.loc[beg:end]['true'] = True
self.dfa.loc[beg:end]['idx'] = i
iloc:
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['true'] = True
self.dfa.loc[self.dfa.index.get_loc(beg):self.dfa.index.get_loc(end)]['idx'] = i
You are chain indexing, see here. The warning is not guaranteed to happen.
You should prob just do this. No real need to actually track the index in b, btw.
In [44]: dfa = pd.DataFrame(np.arange(20).reshape(10,2), columns = ['A', 'B'], index = np.arange(l,l+10))
In [45]: dfb = pd.DataFrame([[l+1,l+3], [l+6,l+9]], columns = ['beg', 'end'])
In [46]: dfa['in_b'] = False
In [47]: for i, s in dfb.iterrows():
....: dfa.loc[s['beg']:s['end'],'in_b'] = True
....:
or this if you have non-integer dtypes
In [36]: for i, s in dfb.iterrows():
dfa.loc[(dfa.index>=s['beg']) & (dfa.index<=s['end']),'in_b'] = True
In [48]: dfa
Out[48]:
A B in_b
1396633637830123000 0 1 False
1396633637830123001 2 3 True
1396633637830123002 4 5 True
1396633637830123003 6 7 True
1396633637830123004 8 9 False
1396633637830123005 10 11 False
1396633637830123006 12 13 True
1396633637830123007 14 15 True
1396633637830123008 16 17 True
1396633637830123009 18 19 True
[10 rows x 3 columns
If b is HUGE this might not be THAT performant.
As an aside, these look like nanosecond times. Can be more friendly by converting them.
In [49]: pd.to_datetime(dfa.index)
Out[49]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-04-04 17:47:17.830123, ..., 2014-04-04 17:47:17.830123009]
Length: 10, Freq: None, Timezone: None