pandas dataframe count row values - python

I have a word list like following.
wordlist = ['p1','p2','p3','p4','p5','p6','p7']
And the dataframe is like following.
df = pd.DataFrame({'id' : [1,2,3,4],
'path' : ["p1,p2,p3,p4","p1,p2,p1","p1,p5,p5,p7","p1,p2,p3,p3"]})
output:
id path
1 p1,p2,p3,p4
2 p1,p2,p1
3 p1,p5,p5,p7
4 p1,p2,p3,p3
I want to count the path data to get following output. Is it possible to get this kind of transformation?
id p1 p2 p3 p4 p5 p6 p7
1 1 1 1 1 0 0 0
2 2 1 0 0 0 0 0
3 1 0 0 0 2 0 1
4 1 1 2 0 0 0 0

I think this would be efficient
# create Series with dictionaries
>>> from collections import Counter
>>> c = df["path"].str.split(',').apply(Counter)
>>> c
0 {u'p2': 1, u'p3': 1, u'p1': 1, u'p4': 1}
1 {u'p2': 1, u'p1': 2}
2 {u'p1': 1, u'p7': 1, u'p5': 2}
3 {u'p2': 1, u'p3': 2, u'p1': 1}
# create DataFrame
>>> pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
update
Another way to do this:
>>> dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
>>> pd.DataFrame(dfN, columns=wordlist).fillna(0)
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
update 2
Some rough tests for performance:
>>> dfL = pd.concat([df]*100)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
0.7363274283027295
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
0.5305424618886718
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(Counter); d = pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd; from collections import Counter', number=100)
1.765344003293876
>>> timeit('splitted = dfL["path"].str.split(","); d = pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})', 'from __main__ import dfL, wordlist; import pandas as pd', number=100)
2.33328927599905
update 3
after reading this topic I've found that Counter is really slow. You can optimize it a bit by using defaultdict:
>>> def create_dict(x):
... d = defaultdict(int)
... for c in x:
... d[c] += 1
... return d
>>> c = df["path"].str.split(",").apply(create_dict)
>>> pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})
p1 p2 p3 p4 p5 p6 p7
0 1 1 1 1 0 0 0
1 2 1 0 0 0 0 0
2 1 0 0 0 2 0 1
3 1 1 2 0 0 0 0
and some tests:
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
0.45942801555111146
# now let's make wordlist larger
>>> wordlist = wordlist + list(lowercase) + list(uppercase)
>>> timeit('c = dfL["path"].str.split(",").apply(create_dict); d = pd.DataFrame({n: c.apply(lambda x: x[n]) for n in wordlist})', 'from __main__ import dfL, wordlist, create_dict; import pandas as pd; from collections import defaultdict', number=100)
1.5798653213942089

You can use the vectorized string method str.count() (see docs and reference), and that for each element in wordlist feed that to a new dataframe:
In [4]: pd.DataFrame({name : df["path"].str.count(name) for name in wordlist})
Out[4]:
p1 p2 p3 p4 p5 p6 p7
id
1 1 1 1 1 0 0 0
2 2 1 0 0 0 0 0
3 1 0 0 0 2 0 1
4 1 1 2 0 0 0 0
UPDATE: some answers to the comments. Indeed this will not work if the strings can be substrings of each other (but the OP should clarify it then). If that is the case, this would work (and is also faster):
splitted = df["path"].str.split(",")
pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name in wordlist})
And some tests to back up my claim of being faster :-)
Off course, I don't know what the realistic use case is, but I made the dataframe a bit larger (just repeated it 1000 times, the differences are bigger then):
In [37]: %%timeit
....: splitted = df["path"].str.split(",")
....: pd.DataFrame({name : splitted.apply(lambda x: x.count(name)) for name i
n wordlist})
....:
100 loops, best of 3: 17.9 ms per loop
In [38]: %%timeit
....: pd.DataFrame({name:df["path"].str.count(name) for name in wordlist})
....:
10 loops, best of 3: 23.6 ms per loop
In [39]: %%timeit
....: c = df["path"].str.split(',').apply(Counter)
....: pd.DataFrame({n: c.apply(lambda x: x.get(n, 0)) for n in wordlist})
....:
10 loops, best of 3: 42.3 ms per loop
In [40]: %%timeit
....: dfN = df["path"].str.split(',').apply(lambda x: pd.Series(Counter(x)))
....: pd.DataFrame(dfN, columns=wordlist).fillna(0)
....:
1 loops, best of 3: 715 ms per loop
I also did the test with more elements in wordlist, and conclusion is: if you have a larger dataframe with relative smaller number of elements in wordlist my approach is faster, if you have a large wordlist the approach with Counter from #RomanPekar can be faster (but only the last one).

something similar to this:
df1 = pd.DataFrame([[path.count(p) for p in wordlist] for path in df['path']],columns=['p1','p2','p3','p4','p5','p6','p7'])

Related

Optimizing nested for-loops

I have a pandas dataframe containing a range of columns A, B, C, D (either 0 or 1) and a range of columns AB, AC, BC, CD that contain their interaction (also either 0 or 1).
Based on the interactions, I want to establish the existence of "triplets" ABC, ABD, ACD, BCD as in the following MWE:
import numpy as np
import pandas as pd
df = pd.DataFrame()
np.random.seed(1)
df["A"] = np.random.randint(2, size=10)
df["B"] = np.random.randint(2, size=10)
df["C"] = np.random.randint(2, size=10)
df["D"] = np.random.randint(2, size=10)
df["AB"] = np.random.randint(2, size=10)
df["AC"] = np.random.randint(2, size=10)
df["AD"] = np.random.randint(2, size=10)
df["BC"] = np.random.randint(2, size=10)
df["BD"] = np.random.randint(2, size=10)
df["CD"] = np.random.randint(2, size=10)
ls = ["A", "B", "C", "D"]
for i, a in enumerate(ls):
for j in range(i + 1, len(ls)):
b = ls[j]
for k in range(j + 1, len(ls)):
c = ls[k]
idx = a+b+c
idx_abc = (df[a]>0) & (df[b]>0) & (df[c]>0)
sum_abc = df[idx_abc][a+b] + df[idx_abc][b+c] + df[idx_abc][a+c]
df[a+b+c]=0
df.loc[sum_abc.index[sum_abc>=2], a+b+c] = 999
This gives the following output:
A B C D AB AC AD BC BD CD ABC ABD ACD BCD
0 1 0 0 0 1 0 0 1 1 0 0 0 0 0
1 1 1 1 0 1 1 1 1 0 0 999 0 0 0
2 0 0 0 1 1 0 1 0 0 1 0 0 0 0
3 0 1 0 1 1 0 0 0 1 1 0 0 0 0
4 1 1 1 1 1 1 1 0 1 1 999 999 999 999
5 1 0 0 1 1 1 1 0 0 0 0 0 0 0
6 1 0 0 1 0 1 1 1 1 1 0 0 0 0
7 1 1 0 0 1 0 1 1 1 1 0 0 0 0
8 1 0 1 0 1 1 0 1 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 1 0 0 0 0
The logic behind the code is the following: A triplet ABC is active (=1) if at least two of the columns AB, AC, BC are active (=1) and the individual columns A, B, C are all active (=1).
I always start by looking at the individual columns (in the case of ABC then this is A, B and C). Looking at columns A, B and C, we only "keep" the rows where A, B and C are all non-zero. Then, looking at the interactions AB, AC and BC, we only "enable" the triplet ABC if at least two out of AB, AC and BC are 1 - which they are only for rows 1 and 4! Hence ABC = 999 for rows 1 and 4 and 0 for all others. This I do for all possible triplets (4 in this case).
The above code runs fast as the dataframe is small. However, in my real code the dataframe has more than one million rows and hundreds of interactions, in which case it runs extremely slow.
Is there a way to optimize the above code, e.g. by multithreading it?
Here is a method that is about 10x faster than your reference code. It does nothing particularly clever, just pedestrian optimization.
import numpy as np
import pandas as pd
df = pd.DataFrame()
np.random.seed(1)
df["A"] = np.random.randint(2, size=10)
df["B"] = np.random.randint(2, size=10)
df["C"] = np.random.randint(2, size=10)
df["D"] = np.random.randint(2, size=10)
df["AB"] = np.random.randint(2, size=10)
df["AC"] = np.random.randint(2, size=10)
df["AD"] = np.random.randint(2, size=10)
df["BC"] = np.random.randint(2, size=10)
df["BD"] = np.random.randint(2, size=10)
df["CD"] = np.random.randint(2, size=10)
ls = ["A", "B", "C", "D"]
def op():
out = df.copy()
for i, a in enumerate(ls):
for j in range(i + 1, len(ls)):
b = ls[j]
for k in range(j + 1, len(ls)):
c = ls[k]
idx = a+b+c
idx_abc = (out[a]>0) & (out[b]>0) & (out[c]>0)
sum_abc = out[idx_abc][a+b] + out[idx_abc][b+c] + out[idx_abc][a+c]
out[a+b+c]=0
out.loc[sum_abc.index[sum_abc>=2], a+b+c] = 99
return out
import scipy.spatial.distance as ssd
def pp():
data = df.values
n = len(ls)
d1,d2 = np.split(data, [n], axis=1)
i,j = np.triu_indices(n,1)
d2 = d2 & d1[:,i] & d1[:,j]
k,i,j = np.ogrid[:n,:n,:n]
k,i,j = np.where((k<i)&(i<j))
lu = ssd.squareform(np.arange(n*(n-1)//2))
d3 = ((d2[:,lu[k,i]]+d2[:,lu[i,j]]+d2[:,lu[k,j]])>=2).view(np.uint8)*99
*triplets, = map("".join, combinations(ls,3))
out = df.copy()
out[triplets] = pd.DataFrame(d3, columns=triplets)
return out
from string import ascii_uppercase
from itertools import combinations, chain
def make(nl=8, nr=1000000, seed=1):
np.random.seed(seed)
letters = np.fromiter(ascii_uppercase, 'U1', nl)
df = pd.DataFrame()
for l in chain(letters, map("".join,combinations(letters,2))):
df[l] = np.random.randint(0,2,nr,dtype=np.uint8)
return letters, df
df1 = op()
df2 = pp()
assert (df1==df2).all().all()
ls, df = make(8,1000)
df1 = op()
df2 = pp()
assert (df1==df2).all().all()
from timeit import timeit
print(timeit(op,number=10))
print(timeit(pp,number=10))
ls, df = make(26,250000)
import time
t0 = time.perf_counter()
df2 = pp()
t1 = time.perf_counter()
print(t1-t0)
Sample run:
3.2022583668585867 # op 8 symbols, 1000 rows, 10 repeats
0.2772211490664631 # pp 8 symbols, 1000 rows, 10 repeats
12.412292044842616 # pp 26 symbols, 250,000 rows, single run

Pandas: Find row wise frequent value

I have a dataset with binary values. I want to find out frequent value in each row. This dataset have couple of millions records. What would be the most efficient way to do it? Following is the sample of the dataset.
import pandas as pd
data = pd.read_csv('myData.csv', sep = ',')
data.head()
bit1 bit2 bit2 bit4 bit5 frequent freq_count
0 0 0 1 1 0 3
1 1 1 0 0 1 3
1 0 1 1 1 1 4
I want to create frequent as well as freq_count columns like the sample above. These are not part of original dataset and will be created after looking at all rows.
Here's one approach -
def freq_stat(df):
a = df.values
zero_c = (a==0).sum(1)
one_c = a.shape[1] - zero_c
df['frequent'] = (zero_c<=one_c).astype(int)
df['freq_count'] = np.maximum(zero_c, one_c)
return df
Sample run -
In [305]: df
Out[305]:
bit1 bit2 bit2.1 bit4 bit5
0 0 0 0 1 1
1 1 1 1 0 0
2 1 0 1 1 1
In [308]: freq_stat(df)
Out[308]:
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Benchmarking
Let's test out this one against the fastest approach from #jezrael's soln :
from scipy import stats
def mod(df): # #jezrael's best soln
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
Also, let's use the same setup from the other post and get the timings -
In [323]: np.random.seed(100)
...: N = 10000
...: #[10000 rows x 20 columns]
...: df = pd.DataFrame(np.random.randint(2, size=(N,20)))
...:
# #jezrael's soln
In [324]: %timeit mod(df)
100 loops, best of 3: 5.92 ms per loop
# Proposed in this post
In [325]: %timeit freq_stat(df)
1000 loops, best of 3: 496 µs per loop
You can use scipy.stats.mode:
from scipy import stats
a = df.values.T
b = stats.mode(a)
print(b)
ModeResult(mode=array([[0, 1, 1]], dtype=int64), count=array([[3, 3, 4]]))
df['frequent'] = b[0][0]
df['freq_count'] = b[1][0]
print (df)
bit1 bit2 bit2.1 bit4 bit5 frequent freq_count
0 0 0 0 1 1 0 3
1 1 1 1 0 0 1 3
2 1 0 1 1 1 1 4
Use Counter.most_common:
from collections import Counter
def f(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Another solution:
def f(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Alternative:
from collections import defaultdict
def f(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
df[['frequent','freq_count']] = df.apply(f, axis=1)
Timings:
np.random.seed(100)
N = 10000
#[10000 rows x 20 columns]
df = pd.DataFrame(np.random.randint(2, size=(N,20)))
In [140]: %timeit df.apply(f1, axis=1)
1 loop, best of 3: 1.78 s per loop
In [141]: %timeit df.apply(f2, axis=1)
1 loop, best of 3: 1.66 s per loop
In [142]: %timeit df.apply(f3, axis=1)
1 loop, best of 3: 1.7 s per loop
In [143]: %timeit mod(df)
100 loops, best of 3: 8.37 ms per loop
In [144]: %timeit mod1(df)
100 loops, best of 3: 8.88 ms per loop
from collections import Counter
from collections import defaultdict
from scipy import stats
def f1(x):
a, b = Counter(x).most_common(1)[0]
return pd.Series([a, b])
def f2(x):
counts = np.bincount(x)
a = np.argmax(counts)
b = np.max(counts)
return pd.Series([a,b])
def f3(x):
d = defaultdict(int)
for i in x:
d[i] += 1
return pd.Series(sorted(d.items(), key=lambda x: x[1], reverse=True)[0])
def mod(df):
a = df.values.T
b = stats.mode(a)
df['a'] = b[0][0]
df['b'] = b[1][0]
return df
def mod1(df):
a = df.values
b = stats.mode(a, axis=1)
df['a'] = b[0][:, 0]
df['b'] = b[1][:, 0]
return df

Pandas: Assign values of column up to a limit set by dictionary values

How can I remove the iterrows()? Can this be done faster with numpy or pandas?
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': 'foo bar foo bar foo bar foo foo'.split(),
'B': 'one one two three two two one three'.split(),
'C': np.arange(8)*0 })
print(df)
# A B C
# 0 foo one 0
# 1 bar one 0
# 2 foo two 0
# 3 bar three 0
# 4 foo two 0
# 5 bar two 0
# 6 foo one 0
# 7 foo three 0
selDict = {"foo":2, "bar":3}
This works:
for i, r in df.iterrows():
if selDict[r["A"]] > 0:
selDict[r["A"]] -=1
df.set_value(i, 'C', 1)
print df
# A B C
# 0 foo one 1
# 1 bar one 1
# 2 foo two 1
# 3 bar three 1
# 4 foo two 0
# 5 bar two 1
# 6 foo one 0
# 7 foo three 0
If I understood correctly, you can use cumcount:
df['C'] = (df.groupby('A').cumcount() < df['A'].map(selDict)).astype('int')
df
Out:
A B C
0 foo one 1
1 bar one 1
2 foo two 1
3 bar three 1
4 foo two 0
5 bar two 1
6 foo one 0
7 foo three 0
Here's one approach -
1) Helper functions :
def argsort_unique(idx):
# Original idea : http://stackoverflow.com/a/41242285/3293881 by #Andras
n = idx.size
sidx = np.empty(n,dtype=int)
sidx[idx] = np.arange(n)
return sidx
def get_bin_arr(grplens, stop1_idx):
count_stops_corr = np.minimum(stop1_idx, grplens)
limsc = np.maximum(grplens, count_stops_corr)
L = limsc.sum()
starts = np.r_[0,limsc[:-1].cumsum()]
shift_arr = np.zeros(L,dtype=int)
stops = starts + count_stops_corr
stops = stops[stops<L]
shift_arr[starts] += 1
shift_arr[stops] -= 1
bin_arr = shift_arr.cumsum()
return bin_arr
Possibly faster alternative with a loopy slicing based helper function :
def get_bin_arr(grplens, stop1_idx):
stop1_idx_corr = np.minimum(stop1_idx, grplens)
clens = grplens.cumsum()
out = np.zeros(clens[-1],dtype=int)
out[:stop1_idx_corr[0]] = 1
for i,j in zip(clens[:-1], clens[:-1] + stop1_idx_corr[1:]):
out[i:j] = 1
return out
2) Main function :
def out_C(A, selDict):
k = np.array(selDict.keys())
v = np.array(selDict.values())
unq, C = np.unique(A, return_counts=1)
sidx3 = np.searchsorted(unq, k)
lims = np.zeros(len(unq),dtype=int)
lims[sidx3] = v
bin_arr = get_bin_arr(C, lims)
sidx2 = A.argsort()
out = bin_arr[argsort_unique(sidx2)]
return out
Sample runs -
Original approach :
def org_app(df, selDict):
df['C'] = 0
d = selDict.copy()
for i, r in df.iterrows():
if d[r["A"]] > 0:
d[r["A"]] -=1
df.set_value(i, 'C', 1)
return df
Case #1 :
>>> df = pd.DataFrame({'A': 'foo bar foo bar res foo bar res foo foo res'.split()})
>>> selDict = {"foo":2, "bar":3, "res":1}
>>> org_app(df, selDict)
A C
0 foo 1
1 bar 1
2 foo 1
3 bar 1
4 res 1
5 foo 0
6 bar 1
7 res 0
8 foo 0
9 foo 0
10 res 0
>>> out_C(df.A.values, selDict)
array([1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0])
Case #2 :
>>> selDict = {"foo":20, "bar":30, "res":10}
>>> org_app(df, selDict)
A C
0 foo 1
1 bar 1
2 foo 1
3 bar 1
4 res 1
5 foo 1
6 bar 1
7 res 1
8 foo 1
9 foo 1
10 res 1
>>> out_C(df.A.values, selDict)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
scipy.stats.rankdata can help here. In order to derive the rank of each element within its bucket, we take the difference between the "min" and "ordinal" methods:
>>> from scipy.stats import rankdata as rd
>>> rd(df.A, 'ordinal') - rd(df.A, 'min')
array([0, 0, 1, 1, 2, 2, 3, 4])
Then we just compare to df.A.map(selDict):
df.C = (rd(df.A, 'ordinal') - rd(df.A, 'min') < df.A.map(selDict)).astype(int)
This may be a little inefficient (calling rankdata twice), but using the optimized routines in scipy should make up for that.
If you can't use scipy you can use repeated argsort() for the "ordinal" method and my solution using unique and bincount for the "min" method:
>>> _, v = np.unique(df.A, return_inverse=True)
>>> df.A.argsort().argsort() - (np.cumsum(np.concatenate(([0], np.bincount(v)))))[v]
0 0
1 0
2 1
3 1
4 2
5 2
6 3
7 4
Name: A, dtype: int64
Then compare to df.A.map(selDict) as above.

Subtract aggregate from Pandas Series/Dataframe [duplicate]

This question already has answers here:
Calculate new value based on decreasing value
(4 answers)
Closed 5 years ago.
Given the following table
vals
0 20
1 3
2 2
3 10
4 20
I'm trying to find a clean solution in pandas to subtract away a value, say 30 for example, to end with the following result.
vals
0 0
1 0
2 0
3 5
4 20
I was wondering if pandas had a solution to performing this that didn't require looping through all the rows in a dataframe, something that takes advantage of pandas's bulk operations.
identify where cumsum is greater than or equal to 30
mask the rows where it isn't
reassign the one row to be the cumsum less 30
c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)
vals
0 0
1 0
2 0
3 5
4 20
Same thing, but numpyfied
v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)
vals
0 0
1 0
2 0
3 5
4 20
Timing
%%timeit
v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)
10000 loops, best of 3: 168 µs per loop
%%timeit
c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)
1000 loops, best of 3: 853 µs per loop
Here's one using NumPy with four lines of code -
v = df.vals.values
a = v.cumsum()-30
idx = (a>0).argmax()+1
v[:idx] = a.clip(min=0)[:idx]
Sample run -
In [274]: df # Original df
Out[274]:
vals
0 20
1 3
2 2
3 10
4 20
In [275]: df.iloc[3,0] = 7 # Bringing in some variety
In [276]: df
Out[276]:
vals
0 20
1 3
2 2
3 7
4 20
In [277]: v = df.vals.values
...: a = v.cumsum()-30
...: idx = (a>0).argmax()+1
...: v[:idx] = a.clip(min=0)[:idx]
...:
In [278]: df
Out[278]:
vals
0 0
1 0
2 0
3 2
4 20
#A one-liner solution
df['vals'] = df.assign(res = 30-df.vals.cumsum()).apply(lambda x: 0 if x.res>0 else x.vals if abs(x.res)>x.vals else x.vals-abs(x.res), axis=1)
df
Out[96]:
vals
0 0
1 0
2 0
3 5
4 20

Identifying consecutive occurrences of a value in a column of a pandas DataFrame

I have a df like so:
Count
1
0
1
1
0
0
1
1
1
0
and I want to return a 1 in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Count New_Value
1 0
0 0
1 1
1 1
0 0
0 0
1 1
1 1
1 1
0 0
I am thinking I may need to use itertools but I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.
You could:
df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
to get:
Count consecutive
0 1 1
1 0 0
2 1 2
3 1 2
4 0 0
5 0 0
6 1 3
7 1 3
8 1 3
9 0 0
From here you can, for any threshold:
threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)
to get:
Count consecutive
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
or, in a single step:
(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:
df = pd.concat([df for _ in range(1000)])
%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop
compared to:
%%timeit
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
pd.Series(l)
10 loops, best of 3: 76.7 ms per loop
Not sure if this is optimized, but you can give it a try:
from itertools import groupby
import pandas as pd
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
df['new_Value'] = pd.Series(l)
df
Count new_Value
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0

Categories

Resources