best way to implement Apriori in python pandas

best way to implement Apriori in python pandas - python

What is the best way to implement the Apriori algorithm in pandas? So far I got stuck on transforming extracting out the patterns using for loops. Everything from the for loop onward does not work. Is there a vectorized way to do this in pandas?
import pandas as pd
import numpy as np
trans=pd.read_table('output.txt', header=None,index_col=0)
def apriori(trans, support=4):
ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()
#user input
collen, rowlen =ts.shape
#max length of items
tssum=ts.sum(axis=1)
maxlen=tssum.loc[tssum.idxmax()]
items=list(ts.columns)
results=[]
#loop through items
for c in range(1, maxlen):
#generate patterns
pattern=[]
for n in len(pattern):
#calculate support
pattern=['supp']=pattern.sum/rowlen
#filter by support level
Condit=pattern['supp']> support
pattern=pattern[Condit]
results.append(pattern)
return results
results =apriori(trans)
print results
When I insert this with support 3
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
it should output something like
Pattern support
a 6
b 7
c 7
d 7
e 3
a,b 5
a,c 4
a,d 4

Assuming I understand what you're after, maybe
from itertools import combinations
def get_support(df):
pp = []
for cnum in range(1, len(df.columns)+1):
for cols in combinations(df, cnum):
s = df[list(cols)].all(axis=1).sum()
pp.append([",".join(cols), s])
sdf = pd.DataFrame(pp, columns=["Pattern", "Support"])
return sdf
would get you started:
>>> s = get_support(df)
>>> s[s.Support >= 3]
Pattern Support
0 a 6
1 b 7
2 c 7
3 d 7
4 e 3
5 a,b 5
6 a,c 4
7 a,d 4
9 b,c 6
10 b,d 4
12 c,d 4
14 d,e 3
15 a,b,c 4
16 a,b,d 3
21 b,c,d 3
[15 rows x 2 columns]

add support, confidence, and lift caculation。
def apriori(data, set_length=2):
import pandas as pd
df_supports = []
dataset_size = len(data)
for combination_number in range(1, set_length+1):
for cols in combinations(data.columns, combination_number):
supports = data[list(cols)].all(axis=1).sum() * 1.0 / dataset_size
confidenceAB = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[0]]==1])
confidenceBA = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[-1]]==1])
liftAB = confidenceAB * dataset_size / len(data[data[cols[-1]]==1])
liftBA = confidenceAB * dataset_size / len(data[data[cols[0]]==1])
df_supports.append([",".join(cols), supports, confidenceAB, confidenceBA, liftAB, liftBA])
df_supports = pd.DataFrame(df_supports, columns=['Pattern', 'Support', 'ConfidenceAB', 'ConfidenceBA', 'liftAB', 'liftBA'])
df_supports.sort_values(by='Support', ascending=False)
return df_supports

Related

Groupby on condition and calculate sum of subgroups

Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15

It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15

Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15

You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15

Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15

Pandas multiply dataframes with multiindex and overlapping index levels

I´m struggling with a task that should be simple, but it is not working as I thought it would. I have two numeric dataframes A and B with multiindex and columns below:
A = A B C D
X 1 AX1 BX1 CX1 DX1
2 AX2 BX2 CX2 DX2
3 AX3 BX3 CX3 DX3
Y 1 AY1 BY1 CY1 DY1
2 AY2 BY2 CY2 DY2
3 AY3 BY3 CY3 DY3
B = A B C D
X 1 a AX1a BX1a CX1a DX1a
b AX1b BX1b CX1b DX1b
c AX1c BX1c CX1c DX1c
2 a AX2a BX2a CX2a DX2a
b AX2b BX2b CX2b DX2b
c AX2c BX2c CX2c DX2c
3 a AX3a BX3a CX3a DX3a
b AX3b BX3b CX3b DX3b
c AX3c BX3c CX3c DX3c
Y 1 a AY1a BY1a CY1a DY1a
b AY1b BY1b CY1b DY1b
c AY1c BY1c CY1c DY1c
2 a AY2a BY2a CY2a DY2a
b AY2b BY2b CY2b DY2b
c AY2c BY2c CY2c DY2c
3 a AY3a BY3a CY3a DY3a
b AY3b BY3b CY3b DY3b
c AY3c BY3c CY3c DY3c ## Heading ##
I´d like to multiply A * B broadcasting over the innermost level of B, I want the resulting dataframe R, below:
R= A B C D
X 1 a (AX1a * AX1) (BX1a * BX1) (CX1a * CX1) (DX1a * DX1)
b (AX1b * AX1) (BX1b * BX1) (CX1b * CX1) (DX1b * DX1)
c (AX1c * AX1) (BX1c * BX1) (CX1c * CX1) (DX1c * DX1)
2 a (AX2a * AX2) (BX2a * BX2) (CX2a * CX2) (DX2a * DX2)
b (AX2b * AX2) (BX2b * BX2) (CX2b * CX2) (DX2b * DX2)
c (AX2c * AX2) (BX2c * BX2) (CX2c * CX2) (DX2c * DX2)
3 a (AX3a * AX3) (BX3a * BX3) (CX3a * CX3) (DX3a * DX3)
b (AX3b * AX3) (BX3b * BX3) (CX3b * CX3) (DX3b * DX3)
c (AX3c * AX3) (BX3c * BX3) (CX3c * CX3) (DX3c * DX3)
Y 1 a (AY1a * AY1) (BY1a * BY1) (CY1a * CY1) (DY1a * DY1)
b (AY1b * AY1) (BY1b * BY1) (CY1b * CY1) (DY1b * DY1)
c (AY1c * AY1) (BY1c * BY1) (CY1c * CY1) (DY1c * DY1)
2 a (AY2a * AY2) (BY2a * BY2) (CY2a * CY2) (DY2a * DY2)
b (AY2b * AY2) (BY2b * BY2) (CY2b * CY2) (DY2b * DY2)
c (AY2c * AY2) (BY2c * BY2) (CY2c * CY2) (DY2c * DY2)
3 a (AY3a * AY3) (BY3a * BY3) (CY3a * CY3) (DY3a * DY3)
b (AY3b * AY3) (BY3b * BY3) (CY3b * CY3) (DY3b * DY3)
c (AY3c * AY3) (BY3c * BY3) (CY3c * CY3) (DY3c * DY3)
I tried using pandas multiply function with level keyword by doing:
b.multiply(a, level=[0,1])
but it throws an error: "TypeError: Join on level between two MultiIndex objects is ambiguous"
What is the right way of doing this operation?

I'd simply use DF.reindex on the lesser shaped DF to match the index of that of the bigger DF's shape and forward fill the values present in it. Then do the multiplication.
B.multiply(A.reindex(B.index, method='ffill')) # Or method='pad'
Demo:
Prep up some data:
np.random.seed(42)
midx1 = pd.MultiIndex.from_product([['X', 'Y'], [1,2,3]])
midx2 = pd.MultiIndex.from_product([['X', 'Y'], [1,2,3], ['a','b','c']])
A = pd.DataFrame(np.random.randint(0,2,(6,4)), midx1, list('ABCD'))
B = pd.DataFrame(np.random.randint(2,4,(18,4)), midx2, list('ABCD'))
Small DF:
>>> A
A B C D
X 1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
Y 1 0 0 1 0
2 1 1 1 0
3 1 0 1 1
Big DF:
>>> B
A B C D
X 1 a 3 3 3 3
b 3 3 2 2
c 3 3 3 2
2 a 3 2 2 2
b 2 2 3 3
c 3 3 3 2
3 a 3 3 2 3
b 2 3 2 3
c 3 2 2 2
Y 1 a 2 2 2 2
b 2 3 3 2
c 3 3 3 3
2 a 2 3 2 3
b 3 3 2 3
c 2 3 2 3
3 a 2 2 3 2
b 3 3 3 3
c 3 3 3 3
Multiplying them after making sure both share a common index axis across all levels:
>>> B.multiply(A.reindex(B.index, method='ffill'))
A B C D
X 1 a 0 3 0 0
b 0 3 0 0
c 0 3 0 0
2 a 0 2 0 0
b 0 2 0 0
c 0 3 0 0
3 a 0 3 0 0
b 0 3 0 0
c 0 2 0 0
Y 1 a 0 0 2 0
b 0 0 3 0
c 0 0 3 0
2 a 2 3 2 0
b 3 3 2 0
c 2 3 2 0
3 a 2 0 3 2
b 3 0 3 3
c 3 0 3 3
Now you can even supply the level parameter in DF.multiply for broadcasting to occur at those matching indices.

Proposed approach
We are talking about broadcasting, thus I would like to bring in NumPy supported broadcasting here.
The solution code would look something like this -
def numpy_broadcasting(df0, df1):
m,n,r = map(len,df1.index.levels)
a0 = df0.values.reshape(m,n,-1)
a1 = df1.values.reshape(m,n,r,-1)
out = (a1*a0[...,None,:]).reshape(-1,a1.shape[-1])
df_out = pd.DataFrame(out, index=df1.index, columns=df1.columns)
return df_out
Basic idea :
1] Get views into the dataframe as multidimensional arrays. The multidimensionality is maintained according to the level structure of the multindex dataframe. Thus, the first dataframe would have three levels (including the columns) and the second one has four levels. Thus, we have a0 and a1 corresponding to the input dataframes df0 and df1, resulting in a0 and a1 having 3 and 4 dimensions respectively.
2) Now, comes the broadcasting part. We simply extend a0 to have 4 dimensions by introducing a new axis at the third position. This new axis would match up against the third axis from df1. This allows us to perform element-wise multiplication.
3) Finally, to get the output multindex dataframe, we simply reshape the product.
Sample run :
1) Input dataframes -
In [369]: df0
Out[369]:
A B C D
0 0 3 2 2 3
1 6 8 1 0
2 3 5 1 5
1 0 7 0 3 1
1 7 0 4 6
2 2 0 5 0
In [370]: df1
Out[370]:
A B C D
0 0 0 4 6 1 2
1 3 3 4 5
2 8 1 7 4
1 0 7 2 5 4
1 8 6 7 5
2 0 4 7 1
2 0 1 4 2 2
1 2 3 8 1
2 0 0 5 7
1 0 0 8 6 1 7
1 0 6 1 4
2 5 4 7 4
1 0 4 7 0 1
1 4 2 6 8
2 3 1 0 6
2 0 8 4 7 4
1 0 6 2 0
2 7 8 6 1
2) Output dataframe -
In [371]: df_out
Out[371]:
A B C D
0 0 0 12 12 2 6
1 9 6 8 15
2 24 2 14 12
1 0 42 16 5 0
1 48 48 7 0
2 0 32 7 0
2 0 3 20 2 10
1 6 15 8 5
2 0 0 5 35
1 0 0 56 0 3 7
1 0 0 3 4
2 35 0 21 4
1 0 28 0 0 6
1 28 0 24 48
2 21 0 0 36
2 0 16 0 35 0
1 0 0 10 0
2 14 0 30 0
Benchmarking
In [31]: # Setup input dataframes of the same shape as stated in the question
...: individuals = list(range(2))
...: time = (0, 1, 2)
...: index = pd.MultiIndex.from_tuples(list(product(individuals, time)))
...: A = pd.DataFrame(data={'A': np.random.randint(0,9,6), \
...: 'B': np.random.randint(0,9,6), \
...: 'C': np.random.randint(0,9,6), \
...: 'D': np.random.randint(0,9,6)
...: }, index=index)
...:
...:
...: individuals = list(range(2))
...: time = (0, 1, 2)
...: P = (0,1,2)
...: index = pd.MultiIndex.from_tuples(list(product(individuals, time, P)))
...: B = pd.DataFrame(data={'A': np.random.randint(0,9,18), \
...: 'B': np.random.randint(0,9,18), \
...: 'C': np.random.randint(0,9,18), \
...: 'D': np.random.randint(0,9,18)}, index=index)
...:
# #DSM's solution
In [32]: %timeit B * A.loc[B.index.droplevel(2)].set_index(B.index)
1 loops, best of 3: 8.75 ms per loop
# #Nickil Maveli's solution
In [33]: %timeit B.multiply(A.reindex(B.index, method='ffill'))
1000 loops, best of 3: 625 µs per loop
# #root's solution
In [34]: %timeit B * np.repeat(A.values, 3, axis=0)
1000 loops, best of 3: 487 µs per loop
In [35]: %timeit numpy_broadcasting(A, B)
1000 loops, best of 3: 191 µs per loop

Note that I am not claiming this is the right way to do this operation, only that it's one way to do it. I've had issues figuring out the right broadcast pattern in the past myself. :-/
The short version is that I wind up doing the broadcasting manually, and creating an appropriately-aligned intermediate object:
In [145]: R = B * A.loc[B.index.droplevel(2)].set_index(B.index)
In [146]: A.loc[("X", 2), "C"]
Out[146]: 0.5294149302910357
In [147]: A.loc[("X", 2), "C"] * B.loc[("X", 2, "c"), "C"]
Out[147]: 0.054262618238601339
In [148]: R.loc[("X", 2, "c"), "C"]
Out[148]: 0.054262618238601339
This works by indexing into A using the matching parts of B, and then setting the index to match. If I were more clever I'd be able to figure out a native way to get this to work but I haven't yet. :-(

Python: how to find values in a dataframe without loop?

I have two dataframes net and M.
net =
i j d
0 5 3 3
1 2 0 2
2 3 2 1
3 4 5 2
4 0 1 3
5 0 3 4
M =
0 1 2 3 4 5
0 0 3 2 4 1 5
1 3 0 2 0 3 3
2 2 2 0 1 1 4
3 4 0 1 0 3 3
4 1 3 1 3 0 2
5 5 3 4 3 2 0
I want to find in M the same values of net['d'], choose randomly a cell in M and create a new dataframe containing the coordinate of that cell. For instance
net['d'][0] = 3
so in M I find:
M[0][1]
M[1][0]
M[1][4]
M[1][5]
...
Finally net1 would be something like that
net1 =
i1 j1 d1
0 1 5 3
1 5 4 2
2 2 3 1
3 1 2 2
4 1 5 3
5 3 0 4
This what I am doing:
I1 = []
J1 = []
for i in net.index:
tmp = net['d'][i]
ds = np.where( M == tmp)
size = len(ds[0])
ind = randint(size) ## find two random locations with distance ds
h = ds[0][ind]
w = ds[1][ind]
I1.append(h)
J1.append(w)
net1 = pd.DataFrame()
net1['i1'] = I1
net1['j1'] = J1
net1['d1'] = net['d']
I am wondering which is the best way to avoid that loop

You can stack the columns of M and then just sample it with replacement
net = pd.DataFrame({'i':[5,2,3,4,0,0],
'j':[3,0,2,5,1,3],
'd':[3,2,1,2,3,4]})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
def random_net(net, M):
# make long table and randomize order of rows and rename columns
net1 = M.stack().reset_index()
net1.columns =['i1', 'j1', 'd1']
# get size of each group for random mapping
net1_id_length = net1.groupby('d1').size()
# add id column to uniquely identify row in net
net_copy = net.copy()
# first map gets size of each group and second gets random integer
net_copy['id'] = net_copy['d'].map(net1_id_length).map(np.random.randint)
net1['id'] = net1.groupby('d1').cumcount()
# make for easy lookup
net_copy = net_copy.set_index(['d', 'id'])
net1 = net1.set_index(['d1', 'id'])
# choose from net1 only those from original net
return net1.reindex(net_copy.index).reset_index('d').reset_index(drop=True).rename(columns={'d':'d1'})
random_net(net, M)
output
d1 i1 j1
0 3 5 1
1 2 0 2
2 1 3 2
3 2 1 2
4 3 3 5
5 4 0 3
Timings on 6 million rows
n = 1000000
net = pd.DataFrame({'i':[5,2,3,4,0,0] * n,
'j':[3,0,2,5,1,3] * n,
'd':[3,2,1,2,3,4] * n})
M = pd.DataFrame({0:[0,3,2,4,1,5],
1:[3,0,2,0,3,3],
2:[2,2,0,1,1,4],
3:[4,0,1,0,3,3],
4:[1,3,1,3,0,2],
5:[5,3,4,3,2,0]})
%timeit random_net(net, M)
1 loop, best of 3: 13.7 s per loop

Identifying consecutive occurrences of a value in a column of a pandas DataFrame

I have a df like so:
Count
1
0
1
1
0
0
1
1
1
0
and I want to return a 1 in a new column if there are two or more consecutive occurrences of 1 in Count and a 0 if there is not. So in the new column each row would get a 1 based on this criteria being met in the column Count. My desired output would then be:
Count New_Value
1 0
0 0
1 1
1 1
0 0
0 0
1 1
1 1
1 1
0 0
I am thinking I may need to use itertools but I have been reading about it and haven't come across what I need yet. I would like to be able to use this method to count any number of consecutive occurrences, not just 2 as well. For example, sometimes I need to count 10 consecutive occurrences, I just use 2 in the example here.

You could:
df['consecutive'] = df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count
to get:
Count consecutive
0 1 1
1 0 0
2 1 2
3 1 2
4 0 0
5 0 0
6 1 3
7 1 3
8 1 3
9 0 0
From here you can, for any threshold:
threshold = 2
df['consecutive'] = (df.consecutive > threshold).astype(int)
to get:
Count consecutive
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0
or, in a single step:
(df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
In terms of efficiency, using pandas methods provides a significant speedup when the size of the problem grows:
df = pd.concat([df for _ in range(1000)])
%timeit (df.Count.groupby((df.Count != df.Count.shift()).cumsum()).transform('size') * df.Count >= threshold).astype(int)
1000 loops, best of 3: 1.47 ms per loop
compared to:
%%timeit
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
pd.Series(l)
10 loops, best of 3: 76.7 ms per loop

Not sure if this is optimized, but you can give it a try:
from itertools import groupby
import pandas as pd
l = []
for k, g in groupby(df.Count):
size = sum(1 for _ in g)
if k == 1 and size >= 2:
l = l + [1]*size
else:
l = l + [0]*size
df['new_Value'] = pd.Series(l)
df
Count new_Value
0 1 0
1 0 0
2 1 1
3 1 1
4 0 0
5 0 0
6 1 1
7 1 1
8 1 1
9 0 0

Best way to split a DataFrame given an edge

Suppose I have the following DataFrame:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
And I need to group it given the "edge B"; that means the groups will be:
a b
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334
4 A 2.226809
5 A 0.768516
6 B -0.015162
7 A 0.710356
8 A 0.151429
That is any time I find a 'B' in the column 'a' I want to split my DataFrame.
My current solution is:
#create the dataframe
s = pd.Series(['A','A','A','B','A','A','B','A','A'])
ss = pd.Series(np.random.randn(9))
dff = pd.DataFrame({"a":s,"b":ss})
#my solution
count = 0
ls = []
for i in s:
if i=="A":
ls.append(count)
else:
ls.append(count)
count+=1
dff['grpb']=ls
and I got the dataframe:
a b grpb
0 A 1.516733 0
1 A 0.035646 0
2 A -0.942834 0
3 B -0.157334 0
4 A 2.226809 1
5 A 0.768516 1
6 B -0.015162 1
7 A 0.710356 2
8 A 0.151429 2
Which I can then split with dff.groupby('grpb').
Is there a more efficient way to do this using pandas' functions?

here's a oneliner:
zip(*dff.groupby(pd.rolling_median((1*(dff['a']=='B')).cumsum(),3,True)))[-1]
[ 1 2
0 A 1.516733
1 A 0.035646
2 A -0.942834
3 B -0.157334,
1 2
4 A 2.226809
5 A 0.768516
6 B -0.015162,
1 2
7 A 0.710356
8 A 0.151429]

How about:
df.groupby((df.a == "B").shift(1).fillna(0).cumsum())
For example:
>>> df
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072
4 A -1.903361
5 A 1.436268
6 B 0.391087
7 A -0.907679
8 A 1.672897
>>> gg = list(df.groupby((df.a == "B").shift(1).fillna(0).cumsum()))
>>> pprint.pprint(gg)
[(0,
a b
0 A -1.957118
1 A -0.906079
2 A -0.496355
3 B 0.552072),
(1, a b
4 A -1.903361
5 A 1.436268
6 B 0.391087),
(2, a b
7 A -0.907679
8 A 1.672897)]
(I didn't bother getting rid of the indices; you could use [g for k, g in df.groupby(...)] if you liked.)

An alternative is:
In [36]: dff
Out[36]:
a b
0 A 0.689785
1 A -0.374623
2 A 0.517337
3 B 1.549259
4 A 0.576892
5 A -0.833309
6 B -0.209827
7 A -0.150917
8 A -1.296696
In [37]: dff['grpb'] = np.NaN
In [38]: breaks = dff[dff.a == 'B'].index
In [39]: dff['grpb'][breaks] = range(len(breaks))
In [40]: dff.fillna(method='bfill').fillna(len(breaks))
Out[40]:
a b grpb
0 A 0.689785 0
1 A -0.374623 0
2 A 0.517337 0
3 B 1.549259 0
4 A 0.576892 1
5 A -0.833309 1
6 B -0.209827 1
7 A -0.150917 2
8 A -1.296696 2
Or using itertools to create 'grpb' is an option too.

def vGroup(dataFrame, edgeCondition, groupName='autoGroup'):
groupNum = 0
dataFrame[groupName] = ''
#loop over each row
for inx, row in dataFrame.iterrows():
if edgeCondition[inx]:
dataFrame.ix[inx, groupName] = 'edge'
groupNum += 1
else:
dataFrame.ix[inx, groupName] = groupNum
return dataFrame[groupName]
vGroup(df, df[0] == ' ')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

best way to implement Apriori in python pandas - python

Related

Groupby on condition and calculate sum of subgroups

Pandas multiply dataframes with multiindex and overlapping index levels

Python: how to find values in a dataframe without loop?

Identifying consecutive occurrences of a value in a column of a pandas DataFrame

Best way to split a DataFrame given an edge

Categories

Resources