I have a function defined like below:
def process_trans(chunk):
grouped_object=chunk.groupby('msno',sort=False) # not sorting results in a minor speedup
func = {
'late_count':['sum'],
'is_discount':['count'],
'is_not_discount':['count'],
'discount':['sum'], 'is_auto_renew':['mean'], 'is_cancel':['mean'], 'payment_type' : ['??'}
result=grouped_object.agg(func)
return result
As you can see, I know that I can insert sum, count, mean for each column. What type of keyword I can insert for determine the payment_type that appear most frequently. Note that each type is represented by integer.
I see people are introducing mode but the index 0 is needed to identify the most frequent item. Any better idea?
I believe you need value_counts and select first value of index, because function return sorted Series:
'payment_type' : lambda x: x.value_counts().index[0]
All together in sample:
chunk = pd.DataFrame({'msno':list('aaaddd'),
'late_count':[4,5,4,5,5,4],
'is_discount':[7,8,9,4,2,3],
'is_not_discount':[4,5,4,5,5,4],
'discount':[7,8,9,4,2,3],
'is_auto_renew':[1,3,5,7,1,0],
'is_cancel':[5,3,6,9,2,4],
'payment_type':[1,0,0,1,1,0]})
print (chunk)
discount is_auto_renew is_cancel is_discount is_not_discount \
0 7 1 5 7 4
1 8 3 3 8 5
2 9 5 6 9 4
3 4 7 9 4 5
4 2 1 2 2 5
5 3 0 4 3 4
late_count msno payment_type
0 4 a 1
1 5 a 0
2 4 a 0
3 5 d 1
4 5 d 1
5 4 d 0
grouped_object=chunk.groupby('msno',sort=False)
func = {
'late_count':['sum'],
'is_discount':['count'],
'is_not_discount':['count'],
'discount':['sum'],
'is_auto_renew':['mean'],
'is_cancel':['mean'],
'payment_type' : [lambda x: x.value_counts().index[0]]}
result=grouped_object.agg(func)
print (result)
is_not_discount is_discount is_cancel discount late_count is_auto_renew \
count count mean sum sum mean
msno
a 3 3 4.666667 24 13 3.000000
d 3 3 5.000000 9 14 2.666667
payment_type
<lambda>
msno
a 0
d 1
You can make use of series.mode i.e
func = {
'late_count':['sum'],
'is_discount':['count'],
'is_not_discount':['count'],
'discount':['sum'], 'is_auto_renew':['mean'], 'is_cancel':['mean'],
'payment_type': lambda x : x.mode()}
# Data from # jezrael.
grouped_object.agg(func).rename(columns={'<lambda>': 'mode'})
Output :
is_not_discount is_auto_renew late_count payment_type discount \
count mean sum mode sum
msno
a 3 3.000000 13 0 24
d 3 2.666667 14 1 9
is_discount is_cancel
count mean
msno
a 3 4.666667
d 3 5.000000
Related
Given a df
a b ngroup
0 1 3 0
1 1 4 0
2 1 1 0
3 3 7 2
4 4 4 2
5 1 1 4
6 2 2 4
7 1 1 4
8 6 6 5
I would like to compute the summation of multiple columns (i.e., a and b) grouped by the column ngroup.
In addition, I would like to count the number of element for each of the group.
Based on these two condition, the expected output as below
a b nrow_same_group ngroup
3 8 3 0
7 11 2 2
4 4 3 4
6 6 1 5
The following code should do the work
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
grouped_df = df.groupby(['ngroup'])
df1 = grouped_df[['a','b']].agg('sum').reset_index()
df2 = df['ngroup'].value_counts().reset_index()
df2.sort_values('index', axis=0, ascending=True, inplace=True, kind='quicksort', na_position='last')
df2.reset_index(drop=True, inplace=True)
df2.rename(columns={'index':'ngroup','ngroup':'nrow_same_group'},inplace=True)
df= pd.merge(df1, df2, on=['ngroup'])
However, I wonder whether there exist built-in pandas that achieve something similar, in single line.
You can do it using only groupby + agg.
import pandas as pd
df=pd.DataFrame(list(zip([1,1,1,3,4,1,2,1,6,10],
[3,4,1,7,4,1,2,1,6,1],
[0,0,0,2,2,4,4,4,5])),columns=['a','b','ngroup'])
res = (
df.groupby('ngroup', as_index=False)
.agg(a=('a','sum'), b=('b', 'sum'),
nrow_same_group=('a', 'size'))
)
Here the parameters passed to agg are tuples whose first element is the column to aggregate and the second element is the aggregation function to apply to that column. The parameter names are the labels for the resulting columns.
Output:
>>> res
ngroup a b nrow_same_group
0 0 3 8 3
1 2 7 11 2
2 4 4 4 3
3 5 6 6 1
First aggregate a, b with sum then calculate size of each group and assign this to nrow_same_group column
g = df.groupby('ngroup')
g.sum().assign(nrow_same_group=g.size())
a b nrow_same_group
ngroup
0 3 8 3
2 7 11 2
4 4 4 3
5 6 6 1
I need help with optimising code, because mine solution is very slow.
I have 2 dataframes. One with 6 columns - 1 column with main items and 5 columns with items recommended. Second df contains sales data per order (each product in separate row).
I need to check what product is flaged as a "main" product, which products are recommended and which are just additional ones. If there is more than 1 main product in order, I need to duplicate that order and set only 1 main product per duplicate.
I tried using pandas for that and found working solution however I used itertuples, spliting both dfs by main items etc. It gaves my right result, but 1 order is computing for almost 2 secs and I have more than 1mln of them.
promo = pd.DataFrame({'main_id':[2,4,6],
'recommended_1':[1,2,8],
'recommended_2':[8,6,9],
'recommended_3':[10,9,10],
'recommended_4': [12,11,11],
'recommended_5': [6,7,8]})
orders = pd.DataFrame({
'order':['a','a','a','b','b','b','c','c'],
'product':[1,2,3,2,4,9,6,9]
})
promo['recommended_list'] = promo[
['recommended_1','recommended_2',
'recommended_3','recommended_4',
'recommended_5']].values.tolist()
flag = pd.DataFrame(
{'flag':orders['product'].isin(promo.main_id)}
)
flaged_orders = pd.concat([orders,flag], axis=1)
main_in_orders = pd.DataFrame(
flaged_orders.query("flag").groupby(['order'])['product']
.agg(lambda x: x.tolist())
)
order_holder = pd.DataFrame()
for index, row in main_in_orders.itertuples():
for item in row:
working_order = orders.query("order==#index")
working_order.loc[working_order['product']==item,'kategoria']='M'
recommended_products = promo.loc[promo['main_id']==item]['recommended_list'].iloc[0]
working_order.loc[working_order['product'].isin(recommended_products), 'kategoria'] = 'R'
working_order['main_id'] = item
order_holder = pd.concat([order_holder, working_order])
# NaN values in this case would be "additional items"
print(order_holder)
So, can u help me with faster alternative? Pointing me in some direction would be awesome, because I've stuck at this for some time. Pandas is optional.
You can do two merge to be able to have all the rows you want, then use np.select to create the column 'kategoria'. The first merge would be to keep only the rows with the 'main_id' in the 'product' column with a inner method, then the second merge would be to create the duplicates if multiple 'main_id' for a same 'order' with a left method.
df_mainid = orders.merge(promo, left_on='product', right_on='main_id', how='inner')
print (df_mainid)
# order product main_id recommended_1 recommended_2 recommended_3 \
# 0 a 2 2 1 8 10
# 1 b 2 2 1 8 10
# 2 b 4 4 2 6 9
# 3 c 6 6 8 9 10
#
# recommended_4 recommended_5
# 0 12 6
# 1 12 6
# 2 11 7
# 3 11 8
So you get only rows with a 'main_id' in the 'product', then
df_merged = orders.merge(df_mainid.drop('product', axis=1), on=['order'], how='left')\
.sort_values(['order', 'main_id'])
print (df_merged)
# order product main_id recommended_1 recommended_2 recommended_3 \
# 0 a 1 2 1 8 10
# 1 a 2 2 1 8 10
# 2 a 3 2 1 8 10
# 3 b 2 2 1 8 10
# 5 b 4 2 1 8 10
# 7 b 9 2 1 8 10
# 4 b 2 4 2 6 9
# 6 b 4 4 2 6 9
# 8 b 9 4 2 6 9
# 9 c 6 6 8 9 10
# 10 c 9 6 8 9 10
# recommended_4 recommended_5
# 0 12 6
# 1 12 6
# 2 12 6
# 3 12 6
# 5 12 6
# 7 12 6
# 4 11 7
# 6 11 7
# 8 11 7
# 9 11 8
# 10 11 8
you get the duplicated 'order' if several 'main_id'. Finally, create the column 'kategoria' with np.select. The first condition is if 'product' is equal to 'main_id' then 'M', the second condition is if the 'product' is in the different column starting with 'recommended' is 'R'. At the end, drop the columns like recommended to get the same output than your order_holder.
conds = [ df_merged['product'].eq(df_merged.main_id) ,
(df_merged['product'][:,None] == (df_merged.filter(like='recommended'))).any(1) ]
choices = ['M', 'R']
df_merged['kategoria'] = np.select( conds , choices , np.nan)
df_merged = df_merged.drop(df_merged.filter(like='recommended').columns, axis=1)
print (df_merged)
order product main_id kategoria
0 a 1 2 R
1 a 2 2 M
2 a 3 2 nan
3 b 2 2 M
5 b 4 2 nan
7 b 9 2 nan
4 b 2 4 R
6 b 4 4 M
8 b 9 4 R
9 c 6 6 M
10 c 9 6 R
I have a dataframe like this:
source target weight
1 2 5
2 1 5
1 2 5
1 2 7
3 1 6
1 1 6
1 3 6
My goal is to remove the duplicate rows, but the order of source and target columns are not important. In fact, the order of two columns are not important and they should be removed. In this case, the expected result would be
source target weight
1 2 5
1 2 7
3 1 6
1 1 6
Is there any way to this without loops?
Use frozenset and duplicated
df[~df[['source', 'target']].apply(frozenset, 1).duplicated()]
source target weight
0 1 2 5
3 3 1 6
4 1 1 6
If you want to account for unordered source/target and weight
df[~df[['weight']].assign(A=df[['source', 'target']].apply(frozenset, 1)).duplicated()]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
However, to be explicit with more readable code.
# Create series where values are frozensets and therefore hashable.
# With hashable things, we can determine duplicity.
# Note that I also set the index and name to set up for a convenient `join`
s = pd.Series(list(map(frozenset, zip(df.source, df.target))), df.index, name='mixed')
# Use `drop` to focus on just those columns leaving whatever else is there.
# This is more general and accommodates more than just a `weight` column.
mask = df.drop(['source', 'target'], axis=1).join(s).duplicated()
df[~mask]
source target weight
0 1 2 5
3 1 2 7
4 3 1 6
5 1 1 6
Should be fairly easy.
data = [[1,2,5],
[2,1,5],
[1,2,5],
[3,1,6],
[1,1,6],
[1,3,6],
]
df = pd.DataFrame(data,columns=['source','target','weight'])
You can drop the duplicates using drop_duplicates
df = df.drop_duplicates(keep=False)
print(df)
would result in:
source target weight
1 2 1 5
3 3 1 6
4 1 1 6
5 1 3 6
because you want to handle the unordered source/target issue.
def pair(row):
sorted_pair = sorted([row['source'],row['target']])
row['source'] = sorted_pair[0]
row['target'] = sorted_pair[1]
return row
df = df.apply(pair,axis=1)
and then you can use df.drop_duplicates()
source target weight
0 1 2 5
3 1 2 7
4 1 3 6
5 1 1 6
with below example:
df = pd.DataFrame({'signal':[1,0,0,1,0,0,0,0,1,0,0,1,0,0],'product':['A','A','A','A','A','A','A','B','B','B','B','B','B','B'],'price':[1,2,3,4,5,6,7,1,2,3,4,5,6,7],'price2':[1,2,1,2,1,2,1,2,1,2,1,2,1,2]})
I have a function "fill_price" to create a new column 'Price_B' based on 'signal' and 'price'. For every 'product' subgroup, Price_B equals to Price if 'signal' is 1. Price_B equals previous row's Price_B if signal is 0. If the subgroup starts with a 0 'signal', then 'price_B' will be kept at 0 until 'signal' turns 1.
Currently I have:
def fill_price(df, signal,price_A):
p = df[price_A].where(df[signal] == 1)
return p.ffill().fillna(0).astype(df[price_A].dtype)
this is then applied using:
df['Price_B'] = fill_price(df,'signal','price')
However, I want to use df.groupby('product').apply() to apply this fill_price function to two subsets of 'product' columns separately, and also apply it to both'price' and 'price2' columns. Could someone help with that?
I basically want to do:
df.groupby('product',groupby_keys=False).apply(fill_price, 'signal','price2')
IIUC, you can use this syntax:
df['Price_B'] = df.groupby('product').apply(lambda x: fill_price(x,'signal','price2')).reset_index(level=0, drop=True)
Output:
price price2 product signal Price_B
0 1 1 A 1 1
1 2 2 A 0 1
2 3 1 A 0 1
3 4 2 A 1 2
4 5 1 A 0 2
5 6 2 A 0 2
6 7 1 A 0 2
7 1 2 B 0 0
8 2 1 B 1 1
9 3 2 B 0 1
10 4 1 B 0 1
11 5 2 B 1 2
12 6 1 B 0 2
13 7 2 B 0 2
You can write this much simplier without the extra function.
df['Price_B'] = (df.groupby('product',as_index=False)
.apply(lambda x: x['price2'].where(x.signal==1).ffill().fillna(0))
.reset_index(level=0, drop=True))
I don't have much experience with working with pandas. I have a pandas dataframe as shown below.
df = pd.DataFrame({ 'A' : [1,2,1],
'start' : [1,3,4],
'stop' : [3,4,8]})
I would like to create a new dataframe that iterates through the rows and appends to resulting dataframe. For example, from row 1 of the input dataframe - Generate a sequence of numbers [1,2,3] and corresponding column to named 1
A seq
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
So far, I've managed to identify what function to use to iterate through the rows of the pandas dataframe.
Here's one way with apply:
(df.set_index('A')
.apply(lambda x: pd.Series(np.arange(x['start'], x['stop'] + 1)), axis=1)
.stack()
.to_frame('seq')
.reset_index(level=1, drop=True)
.astype('int')
)
Out:
seq
A
1 1
1 2
1 3
2 3
2 4
1 4
1 5
1 6
1 7
1 8
If you would want to use loops.
In [1164]: data = []
In [1165]: for _, x in df.iterrows():
...: data += [[x.A, y] for y in range(x.start, x.stop+1)]
...:
In [1166]: pd.DataFrame(data, columns=['A', 'seq'])
Out[1166]:
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
To add to the answers above, here's a method that defines a function for interpreting the dataframe input shown, into a form that the poster wants:
def gen_df_permutations(perm_def_df):
m_list = []
for i in perm_def_df.index:
row = perm_def_df.loc[i]
for n in range(row.start, row.stop+1):
r_list = [row.A,n]
m_list.append(r_list)
return m_list
Call it, referencing the specification dataframe:
gen_df_permutations(df)
Or optionally call it wrapped in a dataframe creation function to return a final dataframe output:
pd.DataFrame(gen_df_permutations(df),columns=['A','seq'])
A seq
0 1 1
1 1 2
2 1 3
3 2 3
4 2 4
5 1 4
6 1 5
7 1 6
8 1 7
9 1 8
N.B. the first column there is the dataframe index that can be removed/ignored as requirements allow.