Count until condition is reached in Pandas - python

I need some input from you. The idea is that I would like to see how long (in rows) it takes before you can see
a new value in column SUB_B1, and
a new value in SUB_B2
i.e, how many steps is there between
SUB_A1 and SUB B1, and
between SUB A2 and SUB B2
I have structured the data something like this: (I sort the index in descending order by the results column. After that I separate index B and A and place them in new columns)
df.sort_values(['A','result'], ascending=[True,False]).set_index(['A','B'])
result
SUB_A1
SUB_A2
SUB_B1
SUB_B2
A
B
10_125
10_173
0.903257
10
125
10
173
10_332
0.847333
10
125
10
332
10_243
0.842802
10
125
10
243
10_522
0.836335
10
125
10
522
58_941
0.810760
10
125
58
941
...
...
...
...
...
...
10_173
10_125
0.903257
10
173
10
125
58_941
0.847333
10
173
58
941
1_941
0.842802
10
173
1
941
96_512
0.836335
10
173
96
512
10_513
0.810760
10
173
10
513
This is what I have done so far: (edit: I think I need to iterate over values[] However, I havent manage to loop beyond the first rows yet...)
def func(group):
if group.SUB_A1.values[0] == group.SUB_B1.values[0]:
group.R1.values[0] = 1
else:
group.R1.values[0] = 0
if group.SUB_A1.values[0] == group.SUB_B1.values[1] and group.R1.values[0] == 1:
group.R1.values[1] = 2
else:
group.R1.values[1] = 0
df['R1'] = 0
df.groupby('A').apply(func)
Expected outcome:
result
SUB_B1
SUB_B2
R1
R2
A
B
10_125
10_173
0.903257
10
173
1
0
10_332
0.847333
10
332
2
0
10_243
0.842802
10
243
3
0 
10_522
0.836335
10
522
4
0
58_941
0.810760
58
941
0
0
...
...
...
...
...
...

Are you looking for something like this:
Sample dataframe:
df = pd.DataFrame(
{"SUB_A": [1, -1, -2, 3, 3, 4, 3, 6, 6, 6],
"SUB_B": [1, 2, 3, 3, 3, 3, 4, 6, 6, 6]},
index=pd.MultiIndex.from_product([range(1, 3), range(1, 6)], names=("A", "B"))
)
SUB_A SUB_B
A B
1 1 1 1
2 -1 2
3 -2 3
4 3 3
5 3 3
2 1 4 3
2 3 4
3 6 6
4 6 6
5 6 6
Now this
equal = df.SUB_A == df.SUB_B
df["R"] = equal.groupby(equal.groupby("A").diff().fillna(True).cumsum()).cumsum()
leads to
SUB_A SUB_B R
A B
1 1 1 1 1
2 -1 2 0
3 -2 3 0
4 3 3 1
5 3 3 2
2 1 4 3 0
2 3 4 0
3 6 6 1
4 6 6 2
5 6 6 3

Try using pandas.DataFrame.iterrows and pandas.DataFrame.shift.
You can iterate through the dataframe and compare current row with the previous one, then add some condition:
df['SUB_A2_last'] = df['SUB_A2'].shift()
count = 0
#Fill column with zeros
df['count_series'] = 0
for index, row in df.iterrows():
subA = row['sub_A2']
subA_last = row['sub_A2_last']
if subA == subA_last:
count += 1
else:
count = 0
df.loc[index, 'count_series'] = count
Then repeat for B column. It is possible to get a better aproach using pandas.DataFrame.apply and a custom function.

Puh! Super! Thanks for the input you guys
def func(group):
for each in range(len(group)):
if group.SUB_A1.values[0] == group.SUB_B1.values[each]:
group.R1.values[each] = each + 1
continue
elif group.SUB_A1.values[0] == group.SUB_B1.values[each] and group.R1.values[each] == each + 1:
group.R1.values[each] = each + 1
else:
group.R1.values[each] = 0
return group
df['R1'] = 0
df.groupby('A').apply(func)

Related

How to remove consecutive pairs of opposite numbers from Pandas Dataframe?

How can i remove consecutive pairs of equal numbers with opposite signs from a Pandas dataframe?
Assuming i have this input dataframe
incremental_changes = [2, -2, 2, 1, 4, 5, -5, 7, -6, 6]
df = pd.DataFrame({
'idx': range(len(incremental_changes)),
'incremental_changes': incremental_changes
})
idx incremental_changes
0 0 2
1 1 -2
2 2 2
3 3 1
4 4 4
5 5 5
6 6 -5
7 7 7
8 8 -6
9 9 6
I would like to get the following
idx incremental_changes
0 0 2
3 3 1
4 4 4
7 7 7
Note that the first 2 could either be idx 0 or 2, it doesn't really matter.
Thanks
Can groupby consecutive equal numbers and transform
import itertools
def remove_duplicates(s):
''' Generates booleans that indicate when a pair of ints with
opposite signs are found.
'''
iter_ = iter(s)
for (a,b) in itertools.zip_longest(iter_, iter_):
if b is None:
yield False
else:
yield a+b == 0
yield a+b == 0
>>> mask = df.groupby(df['incremental_changes'].abs().diff().ne(0).cumsum()) \
['incremental_changes'] \
.transform(remove_duplicates)
Then
>>> df[~mask]
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7
Just do rolling, then we filter the multiple combine
s = df.incremental_changes.rolling(2).sum()
s = s.mask(s[s==0].groupby(s.ne(0).cumsum()).cumcount()==1)==0
df[~(s | s.shift(-1))]
Out[640]:
idx incremental_changes
2 2 2
3 3 1
4 4 4
7 7 7

How do I create a while loop for this df that has moving average in every stage? [duplicate]

This question already has an answer here:
For loop that adds and deducts from pandas columns
(1 answer)
Closed 1 year ago.
So I want to spread the shipments per ID in the group one by one by looking at avg sales to determine who to give it to.
Here's my dataframe:
ID STOREID BAL SALES SHIP
1 STR1 50 5 18
1 STR2 6 7 18
1 STR3 74 4 18
2 STR1 35 3 500
2 STR2 5 4 500
2 STR3 54 7 500
While SHIP (grouped by ID) is greater than 0, calculate AVG (BAL/SALES) and the lowest AVG per group give +1 to its column BAL and +1 to its column final. And then repeat the process until SHIP is 0. The AVG would be different every stage which is why I wanted it to be a while loop.
Sample output of first round is below. So do this until SHIP is 0 and SUM of Final per ID is = to SHIP:
ID STOREID BAL SALES SHIP AVG Final
1 STR1 50 5 18 10 0
1 STR2 6 4 18 1.5 1
1 STR3 8 4 18 2 0
2 STR1 35 3 500 11.67 0
2 STR2 5 4 500 1.25 1
2 STR3 54 7 500 7.71 0
I've tried a couple of ways in SQL, I thought it would be better to do it in python but I haven't been doing a great job with my loop. Here's what I tried so far:
df['AVG'] = 0
df['FINAL'] = 0
for i in df.groupby(["ID"])['SHIP']:
if i > 0:
df['AVG'] = df['BAL'] / df['SALES']
df['SHIP'] = df.groupby(["ID"])['SHIP']-1
total = df.groupby(["ID"])["FINAL"].transform("cumsum")
df['FINAL'] = + 1
df['A'] = + 1
else:
df['FINAL'] = 0
This was challenging because more than one row in the group can have the same average calculation. then it throws off the allocation.
This works on the example dataframe, if I understood you correctly.
d = {'ID': [1, 1, 1, 2,2,2], 'STOREID': ['str1', 'str2', 'str3','str1', 'str2', 'str3'],'BAL':[50, 6, 74, 35,5,54], 'SALES': [5, 7, 4, 3,4,7], 'SHIP': [18, 18, 18, 500,500,500]}
df = pd.DataFrame(data=d)
df['AVG'] = 0
df['FINAL'] = 0
def calc_something(x):
# print(x.iloc[0]['SHIP'])
for i in range(x.iloc[0]['SHIP'])[0:500]:
x['AVG'] = x['BAL'] / x['SALES']
x['SHIP'] = x['SHIP']-1
x = x.sort_values('AVG').reset_index(drop=True)
# print(x.iloc[0, 2])
x.iloc[0, 2] = x['BAL'][0] + 1
x.iloc[0, 6] = x['FINAL'][0] + 1
return x
df_final = df.groupby('ID').apply(calc_something).reset_index(drop=True).sort_values(['ID', 'STOREID'])
df_final
ID STOREID BAL SALES SHIP AVG FINAL
1 1 STR1 50 5 0 10.000 0
0 1 STR2 24 7 0 3.286 18
2 1 STR3 74 4 0 18.500 0
4 2 STR1 127 3 0 42.333 92
5 2 STR2 170 4 0 42.500 165
3 2 STR3 297 7 0 42.286 243

How to select records randomly from the DataFrame?

I have the following single column pandas DataFrame called y. The column is called 0(zero).
y =
1
0
0
1
0
1
1
2
0
1
1
2
2
2
2
1
0
0
I want to select N row indices of records per y value. In the above example, there are 6 records of 0, 7 records of 1 and 5 records of 2.
I need to select 4 records from each of these 3 groups.
Below I provide my code. However this code always selects the first N (e.g. 4) records per class. I need the selection to be done randomly on a whole dataset.
How can I do it?
idx0 = []
idx1 = []
idx2 = []
for i in range(0, len(y[0])):
if y[0].iloc[i]==0 and len(idx0)<=4:
idx0.append(i)
if y[0].iloc[i]==1 and len(idx1)<=4:
idx1.append(i)
if y[0].iloc[i]==2 and len(idx2)<=4:
idx2.append(i)
Update:
The expected outcome is a list of indices, not the filtered DataFrame y.
n = 4
a = y.groupby(0).apply(lambda x: x.sample(n)).reset_index(1).\
rename(columns={'level_1':'indices'}).reset_index(drop=True).groupby(0)['indices'].\
apply(list).reset_index()
class = 0
idx = a.ix[2].tolist()[class]
y.values[idx] # THIS RETURNS WRONG WRONG CLASSES IN SOME CASES
0
1. # <- WRONG
0
0
Use groupby() with df.sample():
n=4
df.groupby('Y').apply(lambda x: x.sample(n)).reset_index(drop=True)
Y
0 0
1 0
2 0
3 0
4 1
5 1
6 1
7 1
8 2
9 2
10 2
11 2
EDIT, for index:
df.groupby('Y').apply(lambda x: x.sample(n)).reset_index(1).\
rename(columns={'level_1':'indices'}).reset_index(drop=True).groupby('Y')['indices'].\
apply(list).reset_index()
Y indices
0 0 [4, 1, 17, 16]
1 1 [0, 6, 10, 5]
2 2 [13, 14, 7, 11]
Using
idx0,idx1,idx2=[ np.random.choice(y.index.values,4,replace=False).tolist()for _, y in df.groupby('0')]
idx0
Out[48]: [1, 2, 16, 8]
To be more detail
s=pd.Series([1,0,1,0,2],index=[1,3,4,5,9])
idx=[1,4] # both anky and mine answer return the index
s.loc[idx] # using .loc with index is correct
Out[59]:
1 1
4 1
dtype: int64
s.values[idx]# using value with slice with index, is wrong
Out[60]: array([0, 2], dtype=int64)
Supposing column "y" belongs to a dataframe "df" and you want to select N=4 random rows:
for i in np.unique(df.y).astype(int):
print(df.y[np.random.choice(np.where(df.y==np.unique(df.y)[i])[0],4)])
You will get:
10116 0
329 0
4709 0
5630 0
Name: y, dtype: int32
382 1
392 1
9124 1
383 1
Name: y, dtype: int32
221 2
443 2
4235 2
5322 2
Name: y, dtype: int32
Edited, to get index:
pd.concat([df.y[np.random.choice(np.where(df.y==np.unique(df.y)[i])[0],4)] for i in np.unique(df.y).astype(int)],axis=0)
You will get:
10116 0
329 0
4709 0
5630 0
382 1
392 1
9124 1
383 1
221 2
443 2
4235 2
5322 2
Name: y, dtype: int32
To get a nested list of indices:
[df.holiday[np.random.choice(np.where(df.holiday==np.unique(df.holiday)[i])[0],4)].index.tolist() for i in np.unique(df.holiday).astype(int)]
You will get:
[[10116,329,4709,5630],[382,392,9124,383],[221,443,4235,5322]]
N = 4
y.loc[y[0]==0].sample(N)
y.loc[y[0]==1].sample(N)
y.loc[y[0]==2].sample(N)

Groupby on condition and calculate sum of subgroups

Here is my data:
import numpy as np
import pandas as pd
z = pd.DataFrame({'a':[1,1,1,2,2,3,3],'b':[3,4,5,6,7,8,9], 'c':[10,11,12,13,14,15,16]})
z
a b c
0 1 3 10
1 1 4 11
2 1 5 12
3 2 6 13
4 2 7 14
5 3 8 15
6 3 9 16
Question:
How can I do calculation on different element of each subgroup? For example, for each group, I want to extract any element in column 'c' which its corresponding element in column 'b' is between 4 and 9, and sum them all.
Here is the code I wrote: (It runs but I cannot get the correct result)
gbz = z.groupby('a')
# For displaying the groups:
gbz.apply(lambda x: print(x))
list = []
def f(x):
list_new = []
for row in range(0,len(x)):
if (x.iloc[row,0] > 4 and x.iloc[row,0] < 9):
list_new.append(x.iloc[row,1])
list.append(sum(list_new))
results = gbz.apply(f)
The output result should be something like this:
a c
0 1 12
1 2 27
2 3 15
It might just be easiest to change the order of operations, and filter against your criteria first - it does not change after the groupby.
z.query('4 < b < 9').groupby('a', as_index=False).c.sum()
which yields
a c
0 1 12
1 2 27
2 3 15
Use
In [2379]: z[z.b.between(4, 9, inclusive=False)].groupby('a', as_index=False).c.sum()
Out[2379]:
a c
0 1 12
1 2 27
2 3 15
Or
In [2384]: z[(4 < z.b) & (z.b < 9)].groupby('a', as_index=False).c.sum()
Out[2384]:
a c
0 1 12
1 2 27
2 3 15
You could also groupby first.
z = z.groupby('a').apply(lambda x: x.loc[x['b']\
.between(4, 9, inclusive=False), 'c'].sum()).reset_index(name='c')
z
a c
0 1 12
1 2 27
2 3 15
Or you can use
z.groupby('a').apply(lambda x : sum(x.loc[(x['b']>4)&(x['b']<9),'c']))\
.reset_index(name='c')
Out[775]:
a c
0 1 12
1 2 27
2 3 15

best way to implement Apriori in python pandas

What is the best way to implement the Apriori algorithm in pandas? So far I got stuck on transforming extracting out the patterns using for loops. Everything from the for loop onward does not work. Is there a vectorized way to do this in pandas?
import pandas as pd
import numpy as np
trans=pd.read_table('output.txt', header=None,index_col=0)
def apriori(trans, support=4):
ts=pd.get_dummies(trans.unstack().dropna()).groupby(level=1).sum()
#user input
collen, rowlen =ts.shape
#max length of items
tssum=ts.sum(axis=1)
maxlen=tssum.loc[tssum.idxmax()]
items=list(ts.columns)
results=[]
#loop through items
for c in range(1, maxlen):
#generate patterns
pattern=[]
for n in len(pattern):
#calculate support
pattern=['supp']=pattern.sum/rowlen
#filter by support level
Condit=pattern['supp']> support
pattern=pattern[Condit]
results.append(pattern)
return results
results =apriori(trans)
print results
When I insert this with support 3
a b c d e
0
11 1 1 1 0 0
666 1 0 0 1 1
10101 0 1 1 1 0
1010 1 1 1 1 0
414147 0 1 1 0 0
10101 1 1 0 1 0
1242 0 0 0 1 1
101 1 1 1 1 0
411 0 0 1 1 1
444 1 1 1 0 0
it should output something like
Pattern support
a 6
b 7
c 7
d 7
e 3
a,b 5
a,c 4
a,d 4
Assuming I understand what you're after, maybe
from itertools import combinations
def get_support(df):
pp = []
for cnum in range(1, len(df.columns)+1):
for cols in combinations(df, cnum):
s = df[list(cols)].all(axis=1).sum()
pp.append([",".join(cols), s])
sdf = pd.DataFrame(pp, columns=["Pattern", "Support"])
return sdf
would get you started:
>>> s = get_support(df)
>>> s[s.Support >= 3]
Pattern Support
0 a 6
1 b 7
2 c 7
3 d 7
4 e 3
5 a,b 5
6 a,c 4
7 a,d 4
9 b,c 6
10 b,d 4
12 c,d 4
14 d,e 3
15 a,b,c 4
16 a,b,d 3
21 b,c,d 3
[15 rows x 2 columns]
add support, confidence, and lift caculation。
def apriori(data, set_length=2):
import pandas as pd
df_supports = []
dataset_size = len(data)
for combination_number in range(1, set_length+1):
for cols in combinations(data.columns, combination_number):
supports = data[list(cols)].all(axis=1).sum() * 1.0 / dataset_size
confidenceAB = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[0]]==1])
confidenceBA = data[list(cols)].all(axis=1).sum() * 1.0 / len(data[data[cols[-1]]==1])
liftAB = confidenceAB * dataset_size / len(data[data[cols[-1]]==1])
liftBA = confidenceAB * dataset_size / len(data[data[cols[0]]==1])
df_supports.append([",".join(cols), supports, confidenceAB, confidenceBA, liftAB, liftBA])
df_supports = pd.DataFrame(df_supports, columns=['Pattern', 'Support', 'ConfidenceAB', 'ConfidenceBA', 'liftAB', 'liftBA'])
df_supports.sort_values(by='Support', ascending=False)
return df_supports

Categories

Resources