I have a huge multiindex dataframe. I whish to create new columns based on the part of the content of the multiindex. This is what I have:
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'foo', 'foo','foo','qux', 'qux'],
['one', 'two', 'three', 'one', 'four', 'one', 'two', 'eight','one', 'two'],
['green', 'green', 'blue', 'blue', 'black', 'black', 'orange', 'green','blue', 'black'] ]
s = pd.DataFrame(np.random.randn(10), index=arrays)
s.index.names = ['p1','p2','p3']
s
0
p1 p2 p3
bar one green -0.676472
two green -0.030377
three blue -0.957517
baz one blue 0.710764
four black 0.404377
foo one black -0.286358
two orange -1.620832
eight green 0.316170
qux one blue -0.433310
two black 1.127754
This is what I want:
0 x1 x2 x3
p1 p2 p3
bar one green 1.563381 1 0 1
two green 0.193622 0 0 0
three blue 0.046728 1 0 0
baz one blue 0.098216 0 0 0
black 1.826574 0 1 0
foo one black -0.120856 1 1 1
two orange 0.605020 0 0 0
eight green 0.693606 0 0 0
qux one blue 0.588244 1 1 1
two black -0.872104 1 1 1
Now, in pseudo code I want to:
if (p1 =='bar') & (p2 == 'one') & (p3 == 'green'): s['x1'] = 1, s['x3'] = 1
if (p1 == 'bar') & (p3 == 'blue'): s['x1'] = 1
if (p1 == 'baz') & (p3 == 'black'): s['x2'] = 1
if (p1 =='foo') & (p2 == 'one') & (p3 == 'black'): s['x1'] = 1, s['x2'] = 1, s['x3'] = 1
if (p1 == 'qux'): s['x1'] = 1, s['x2'] = 1, s['x3'] = 1
I.e. based on the values of the columns of the multiindex, I want to assign 1 to the new x-columns.
I am looking for a vectorized approach like numpy.select (condition, choice), but I can't get numpy.select to work with more than one choice per condition.
Since I have 14 index columns I would prefer if the name of the column I condition on is used explicitly (i.e. (p1 == 'bar') & (p2 == 'one') is preferred instead of ['bar','one',]).
Any guidance would be much appreciated!
Thanks for the help!
Here is possible use selection by index slices and set columns by 1 like:
idx = pd.IndexSlice
s = s.assign(x1=0, x2=0, x3=0)
s.loc[idx['bar','one','green'], ['x1','x3']] = 1
s.loc[idx['bar',:,'blue'], ['x1']] = 1
s.loc[idx['baz',:,'black'], ['x2']] = 1
s.loc[idx['foo','one','black'], ['x1','x2','x3']] = 1
s.loc[idx['qux',:,:], ['x1','x2','x3']] = 1
print (s)
0 x1 x2 x3
p1 p2 p3
bar one green 0.152556 1 0 1
two green 0.488762 0 0 0
three blue 0.037346 1 0 0
baz one blue 1.903518 0 0 0
four black 0.589922 0 1 0
foo one black 0.871984 1 1 1
two orange 0.514062 0 0 0
eight green -0.177246 0 0 0
qux one blue 0.740046 1 1 1
two black 0.755664 1 1 1
EDIT:
def get_i(lev, val):
return s.index.get_level_values(lev) == val
s = s.assign(x1=0, x2=0, x3=0)
s.loc[get_i('p1','bar') & get_i('p2','one') & get_i('p3','green'), ['x1','x3']] = 1
s.loc[get_i('p1','bar') & get_i('p3','blue'), ['x1']] = 1
s.loc[get_i('p1','baz') & get_i('p3','black'), ['x2']] = 1
s.loc[get_i('p1','foo') & get_i('p2','one') & get_i('p3','black'), ['x1','x2','x3']] = 1
s.loc[get_i('p1','qux'), ['x1','x2','x3']] = 1
print (s)
0 x1 x2 x3
p1 p2 p3
bar one green -0.029773 1 0 1
two green -1.505461 0 0 0
three blue 1.819085 1 0 0
baz one blue 0.645498 0 0 0
four black -1.119554 0 1 0
foo one black 1.002072 1 1 1
two orange -0.461030 0 0 0
eight green -2.565080 0 0 0
qux one blue 0.286967 1 1 1
two black -0.522340 1 1 1
An extension to #jezrael's solution : combination of query and index slice could help with usage of the index names :
#conditions
cond1 = s.query('p1=="bar" and p2=="one" and p3=="green"').index
cond2 = s.query('p1=="bar" and p3=="blue"').index
cond3 = s.query('p1=="baz" and p3=="black"').index
cond4 = s.query('p1=="foo" and p2=="one" and p3=="black"').index
cond5 = s.query('p1=="qux"').index
idx = pd.IndexSlice
#create zero columns
s = s.assign(x1=0,x2=0,x3=0)
#assign values :
s.loc[idx[cond1], ["x1","x3"]] = 1
s.loc[idx[cond2], ["x1"]] = 1
s.loc[idx[cond3], ['x2']] = 1
s.loc[idx[cond4], ['x1', 'x2','x3']] = 1
s.loc[idx[cond5], ['x1', 'x2','x3']] = 1
0 x1 x2 x3
p1 p2 p3
bar one green 1.122544 1 0 1
two green 0.157234 0 0 0
three blue 0.760863 1 0 0
baz one blue -0.194400 0 0 0
four black 0.937159 0 1 0
foo one black -0.986325 1 1 1
two orange -0.002486 0 0 0
eight green 0.067649 0 0 0
qux one blue 1.024345 1 1 1
two black 0.884644 1 1 1
Related
I am a bit lost here on how to have an easy the solution in Python Pandas
I have a dataframe with 3 columns:
A B val
P1 P2 12
P1 P2 14
P2 P2 18
P2 P1 17
P1 P3 15
P1 P3 16
P1 P3 13
I want to count group by A and B, value in specifics intervalls, manually defined in another dataframe:
MIN MAX
12 12
13 15
16 17
The result should be the count number on the intervall and rest as presented:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
P1 P2 1 1 0 0
P2 P2 0 0 0 1
P2 P1 0 0 1 0
P1 P3 0 2 1 0
I want to have the result dynamically, if I change intervalls, remove or add other it should change column names or number in the final dataframe.
Thanks for help.
Try something like this using pd.cut:
df = pd.read_clipboard()
df2 = pd.read_clipboard()
df['labels']=pd.cut(df['val'],
bins=[0]+df2['MAX'].tolist()+[np.inf],
labels = [f'V_{s}_{e}' for s, e in zip(df2['MIN'], df2['MAX'])]+['V_OTHERS'])
df.groupby(['A','B','labels'])['labels'].count().unstack().reset_index()
Output:
labels A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P1 0 0 0 0
1 P1 P2 1 1 0 0
2 P1 P3 0 2 1 0
3 P2 P1 0 0 1 0
4 P2 P2 0 0 0 1
5 P2 P3 0 0 0 0
You can use a pd.IntervalIndex built from your MIN and MAX columns to cut the values before grouping:
import pandas as pd
# Your data here
df = pd.DataFrame({'A': {0: 'P1', 1: 'P1', 2: 'P2', 3: 'P2', 4: 'P1', 5: 'P1', 6: 'P1'}, 'B': {0: 'P2', 1: 'P2', 2: 'P2', 3: 'P1', 4: 'P3', 5: 'P3', 6: 'P3'}, 'val': {0: 12, 1: 14, 2: 18, 3: 17, 4: 15, 5: 16, 6: 13}})
intervals = pd.DataFrame({'MIN': {0: 12, 1: 13, 2: 16}, 'MAX': {0: 12, 1: 15, 2: 17}})
idx = pd.IntervalIndex.from_arrays(intervals["MIN"], intervals["MAX"] , closed="both")
intervals = pd.cut(df["val"], idx)
groups = [df["A"], df["B"]]
renamer = lambda x: f"V_{x.left}_{x.right}" if isinstance(x, pd.Interval) else x
out = pd.concat([
intervals.groupby(groups).value_counts().unstack(), # This handles all values within some interval
intervals.isna().groupby(groups).agg(V_OTHERS="sum") # This handles the V_OTHERS column
], axis=1).rename(columns=renamer).reset_index()
out:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P1 P3 0 2 1 0
2 P2 P1 0 0 1 0
3 P2 P2 0 0 0 1
pandas.IntervalIndex can do .loc on values.
Assuming df1 is your first dataframe with A, B, val columns and df2 is your second dataframe with MIN and MAX columns.
Using pandas.IntervalIndex and pandas.crosstab:
# Construct a interval index from df2 which is your MIN, MAX dataframe
ii = pd.IntervalIndex.from_arrays(df2["MIN"], df2["MAX"], closed="both")
# Then look for df1[val] values which fall in between MIN, MAX from df2
m = df1["val"].ge(df2["MIN"].min()) & df1["val"].le(df2["MAX"].max())
# Those values you locate using IntervalIndex and format the interval
# you found as you wanted i.e V_{}_{}
df1.loc[m, "interval"] = [
f"V_{x.left}_{x.right}"
for x in pd.DataFrame(index=ii).loc[df1.loc[m, "val"]].index
]
# Others with 'V_OTHERS'
df1.loc[~m, "interval"] = "V_OTHERS"
# Then use crosstab to find the sum of occurrences
out = (
pd.crosstab([df1["A"], df1["B"]], df1["interval"])
.reset_index()
.rename_axis("", axis=1)
)
print(out)
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P1 P3 0 2 1 0
2 P2 P1 0 0 1 0
3 P2 P2 0 0 0 1
Calling the second dataframe limits below:
diffs = np.subtract.outer(df["val"].to_numpy(),
limits.to_numpy()).reshape(len(df), -1)
from_min, from_max = diffs[:, ::2], diffs[:, 1::2]
counts = (pd.DataFrame((from_min >= 0) & (from_max <= 0))
.groupby([df["A"], df["B"]], sort=False).sum())
counts.columns = limits.astype(str).agg("_".join, axis=1).radd("V_")
counts["V_OTHERS"] = df.groupby(["A", "B"]).count().sub(counts.sum(axis=1), axis=0)
counts = counts.reset_index()
get the "cross" differences of "val" column values against the min & max limits each
that outer subtraction will give a shape "(len(df), *limits.shape)"
make it flattened in the last 2 dimensions to make it 2D to add as more columns
differentiate differences from_min and from_max
check if a value falls in between the ranges: greater than minimum, less than maximum
group these by "A" and "B" and sum those True/False's to count
pull out the names of the new columns from the contents of limits
row wise aggregation with "_" joining, and add from right "V_"
lastly compute the remainders
see however A & B pairs there are, and subtract the aforecomputed counts from them
and reset the index to move groupers to columns
to get
>>> counts
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P2 P2 0 0 0 1
2 P2 P1 0 0 1 0
3 P1 P3 0 2 1 0
Try this:
def find_group(val):
if 12 <= val <= 12:
return "V_12_12"
elif 13 <= val <= 15:
return "V_13_15"
elif 16 <= val <= 17:
return "V_16_17"
else:
return "V_OTHERS"
df = pd.DataFrame({
'A':['P1','P1','P2','P2','P1','P1','P1'],
'B':['P2','P2','P2','P1','P3','P3','P3'],
'val':[12,14,18,17,15,16,13]
})
df["group"]=df["val"].apply(find_group)
result=df.groupby(["A","B","group"]).count().unstack(fill_value=0).stack()
result.unstack()
Thank you so much for answering this :)
I have a pandas data frame where we have one hot encoded information about whether a customer has a certain product or not
customer revenue P1 P2 P3
Customer1 $1 1 0 0
Customer2 $2 1 1 0
Customer3 $3 0 1 1
Customer4 $4 1 1 1
Customer5 $5 1 0 0
for customer1 revenue is $1 and it has products P1 only,
similarly, customer4 has all products
I want to transform this data to show all possible combinations of the products and the count of customers that have those combinations, like so
combinations Count of Customers Sum of revenue
P1 only 2 $6 ---> c1+c5
P2 only
P3 only
P1+P2+P3 1 $4 ---> c4
P1+P2 1 $2 ---> c2
P1+P3
P2+P3 1 $3 ---> c3
What's the best way to achieve this?
I believe The combinations of products can be checked like this
import itertools
li=df.columns.to_list()
del(li[0:2])
list(itertools.combinations(li, 2))+list(itertools.combinations(li, 1))+list(itertools.combinations(li, 3))
The key here is to revert the operation of get_dummies (or OneHotEncoder) to get a unique combination of products per customer to be able to group and aggregate:
# Generate labels P1, P2, P3, P1+P2, ...
cols = ['P1', 'P2', 'P3']
labels = ['+'.join(c) for i in range(len(cols)) for c in combinations(cols, i+1)]
df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
.groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
'Count of customers': ('customer', 'nunique')}) \
.assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels)+1), labels)))) \
.set_index('Product').reindex(labels, fill_value=0).reset_index()
print(df1)
# Output
Product Sum of revenue Count of customers
0 P1 6 2
1 P2 0 0
2 P3 2 1
3 P1+P2 0 0
4 P1+P3 0 0
5 P2+P3 3 1
6 P1+P2+P3 4 1
For each unique combination, you assign a label:
# P1=1, P2=2, P3=4 (power of 2) so P1+P3=1+4=5
>>> dict(zip(range(1, len(labels)+1), labels))
{1: 'P1',
2: 'P2',
3: 'P3',
4: 'P1+P2',
5: 'P1+P3',
6: 'P2+P3',
7: 'P1+P2+P3'}
Detail about reverting the get_dummies operation:
>>> df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)
0 1 # P1 only
1 3 # P3 only
2 6 # P2+P3
3 7 # P1+P2+P3
4 1 # P1 only
dtype: int64
Update
Can you guide me with how to return the name of customers for each combination ?
df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
.groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
'Count of customers': ('customer', 'nunique'),
'Customers': ('customer', ', '.join)}) \
.assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels)+1), labels)))) \
.set_index('Product').reindex(labels, fill_value=0).reset_index() \
.replace({'Customers': {0: ''}})
print(df1)
# Output
Product Sum of revenue Count of customers Customers
0 P1 6 2 Customer1, Customer5
1 P2 0 0
2 P3 2 1 Customer2
3 P1+P2 0 0
4 P1+P3 0 0
5 P2+P3 3 1 Customer3
6 P1+P2+P3 4 1 Customer4
Since you are already using one-hot encoding, let's try to solve this problem using binary masks:
# Extract the product columns
cols = df.columns[2:]
# Make a bitmask for each column.
# 1 = b001 -> P1
# 2 = b010 -> P2
# 4 = b100 -> P3
bitmask = np.power(2, np.arange(len(cols)))
# Generate the label for each possible combination of products
# 0 = b000 -> Nothing
# 1 = b001 -> P1
# 2 = b010 -> P2
# 3 = b011 -> P1 + P2
# ...
# 7 = b111 -> P1 + P2 + P3
labels = np.array([
" + ".join(cols[np.bitwise_and(bitmask, i) != 0])
for i in range(2 ** len(cols))
])
labels[0] = "Nothing"
# Assign each row in the df to a group based on the
# product combination (P1, P1 + P2, etc.)
# Using numpy broadcasting to multiply the product list
# with the bitmask:
# df[cols] * bitmask[None, :] -> groups
# P1 P2 P3
# 1 0 0 1 2 4 1*1 + 0*2 + 0*4 = 1
# 1 1 0 1 2 4 1*1 + 1*2 + 0*4 = 3
# 0 1 1 1 2 4 0*1 + 1*2 + 1*4 = 6
# 1 1 1 1 2 4 ... = 7
# 1 0 0 1 2 4 ... = 1
groups = (df[cols].to_numpy() * bitmask[None, :]).sum(axis=1)
# The result
df.groupby(labels[groups]).agg(**{
"Count of Customers": ("customer", "count"),
"Sum of Revenue": ("revenue", "sum")
}).reindex(labels, fill_value=0)
You can do something like this:
In [2692]: from itertools import product
In [2673]: x = df.pivot(['P1', 'P2', 'P3'], 'customer', 'revenue')
In [2661]: ix = pd.MultiIndex.from_tuples(list(product(set(range(2)),repeat = 3)), names=['P1', 'P2', 'P3'])
In [2688]: s = pd.DataFrame(index=ix)
In [2681]: output = s.join(x).sum(1).to_frame('Sum of revenue')
In [2686]: output['Count of Customers'] = s.join(x).count(1)
In [2687]: output
Out[2687]:
Sum of revenue Count of Customers
P1 P2 P3
0 0 0 0.0 0
1 0.0 0
1 0 0.0 0
1 3.0 1
1 0 0 6.0 2
1 0.0 0
1 0 2.0 1
1 4.0 1
my_df = df.groupby(df.columns.tolist()[1:]).agg({'P1':'size', 'revenue':"sum"})
my_df.columns = ['Count of Customers ', 'Sum of revenue ']
I need some help to modify my function and how to apply it in order to iterate an ifelse condition through multiple features.
Suppose we have the following table t1
import pandas as pd
names = {'name': ['Jon','Bill','Maria','Emma']
,'feature1': [2,3,4,5]
,'feature2': [1,2,3,4]
,'feature3': [1,2,3,4]}
t1 = pd.DataFrame(names,columns=['name','feature1','feature2','feature3'])
I want to create 3 new columns based on an ifelse condition. Here is how I am doing it for the first feature:
# Define the conditions
def ifelsefunction(row):
if row['feature1'] >=3:
return 1
elif row['feature1'] ==2:
return 2
else:
return 0
# Apply the condition
t1['ft1'] = t1.apply(ifelsefunction, axis=1)
I would like to write the function into something iterable like this
def ifelsefunction(row, feature):
if row[feature] >=3:
return 1
elif row[feature] ==2:
return 2
else:
return 0
t1['ft1_score'] = t1.apply(ifelsefunction(row, 'feature1'), axis=1)
t1['ft2_score'] = t1.apply(ifelsefunction(row, 'feature2'), axis=1)
t1['ft3_score'] = t1.apply(ifelsefunction(row, 'feature3'), axis=1)
---- EDIT ----
Thanks for the answers, I may have over-simplified the actual problem.
How do I do the same for this conditions?
def ifelsefunction(var1, var2):
mask1 = (var1 >=3) and (var1<var2)
mask2 = var1 == 2
return np.select([mask1,mask2], [var1*0.7, var1*var2], default=0)
I think here is best avoid loops, use numpy.select for test and assign mask only for selected columns from list, for pass function with input DataFrame is used DataFrame.pipe:
# Define the conditions
def ifelsefunction(df):
m1 = df >= 3
m2 = df == 2
return np.select([m1, m2], [1, 2], default=0)
cols = ['feature1','feature2','feature3']
t1[cols] = t1[cols].pipe(ifelsefunction)
#alternative
#t1[cols] = ifelsefunction(t1[cols])
print (t1)
name feature1 feature2 feature3
0 Jon 2 0 0
1 Bill 1 2 2
2 Maria 1 1 1
3 Emma 1 1 1
For new columns use:
# Define the conditions
def ifelsefunction(df):
m1 = df >= 3
m2 = df == 2
return np.select([m1, m2], [1, 2], default=0)
cols = ['feature1','feature2','feature3']
new = [f'{x}_score' for x in cols]
t1[new] = t1[cols].pipe(ifelsefunction)
#alternative
#t1[new] = ifelsefunction(t1[cols])
print (t1)
name feature1 feature2 feature3 feature1_score feature2_score \
0 Jon 2 1 1 2 0
1 Bill 3 2 2 1 2
2 Maria 4 3 3 1 1
3 Emma 5 4 4 1 1
feature3_score
0 0
1 2
2 1
3 1
EDIT:
You can change function like:
def ifelsefunction(df, var1, var2):
mask1 = (df[var1] >=3) & (df[var1]<df[var2])
mask2 = df[var1] == 2
return np.select([mask1,mask2], [df[var1]*0.7, df[var1]*df[var2]], default=0)
t1['new'] = ifelsefunction(t1, 'feature3','feature1')
print (t1)
name feature1 feature2 feature3 new
0 Jon 2 1 1 0.0
1 Bill 3 2 2 6.0
2 Maria 4 3 3 2.1
3 Emma 5 4 4 2.8
Try using apply
t1["ft1_score"] = t1.feature1.apply(lambda x: 1 if x >= 3 else (2 if x == 2 else 0))
t1["ft2_score"] = t1.feature2.apply(lambda x: 1 if x >= 3 else (2 if x == 2 else 0))
t1["ft3_score"] = t1.feature3.apply(lambda x: 1 if x >= 3 else (2 if x == 2 else 0))
For the following dataframe:
Headers: Name P1 P2 P3
L1: A 1 0 2
L2: B 1 1 1
L3: C 0 5 6
I want to get yes where all P1, P2 and P3 are greater than 0.
Currently I am using either of the following methods:
Method1:
df['Check']= np.where((df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0),'Yes','No')
Method2:
df.loc[(df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0), 'Check'] = "Yes"
I have a large dataset with a lot of columns where the conditions are to be applied.
Is there a shorter alternative to the multiple & conditions, wherein I won't have to write the conditions for each and every variable and instead use a combined index range for the multiple columns?
I think need DataFrame.all for check all Trues per rows:
cols = ['P1','P2','P3']
df['Check']= np.where((df[cols] > 0).all(axis=1),'Yes','No')
print (df)
Name P1 P2 P3 Check
0 A 1 0 2 No
1 B 1 1 1 Yes
2 C 0 5 6 No
print ((df[cols] > 0))
P1 P2 P3
0 True False True
1 True True True
2 False True True
print ((df[cols] > 0).all(axis=1))
0 False
1 True
2 False
dtype: bool
I want to plot factor plots for my data. I tried doing it the below way but the chart values didn't turn out as expected.
df = pd.DataFrame({'subset_product':['A','A','A','B','B','C','C'],
'subset_close':[1,1,0,1,1,1,0]})
prod_counts = df.groupby('subset_product').size().rename('prod_counts')
df['prod_count'] = df['subset_product'].map(prod_counts)
g = sns.factorplot(y='prod_count',x='subset_product',hue='subset_close',data=df,kind='bar',palette='muted',legend=False,ci=None)
plt.legend(loc='best')
However, my plots all have the same height, meaning it didn't separate the data into '1' and '0'.
Example: For A, the blue bar should have height = 1, and the green bar should have height = 2.
The problem is your 'prod_count'.
print(df)
# subset_close subset_product prod_count
# 0 1 A 3
# 1 1 A 3
# 2 0 A 3
# 3 1 B 2
# 4 1 B 2
# 5 1 C 2
# 6 0 C 2
You are telling seaborn that y is 3 when subset_close == 1 & subset_product == A and y is also 3 when subset_close == 0 & subset_product == A.
Below should do what you want.
# Count the number of each (`subset_close`, `subset_product`) combination.
df2 = df.groupby(['subset_product', 'subset_close']).size().reset_index(name='prod_count')
# Plot
g = sns.factorplot(y='prod_count', x='subset_product', hue='subset_close', data=df2,
kind='bar', palette='muted', legend=False, ci=None)
plt.legend(loc='best')
plt.show()
print(df2)
# subset_product subset_close prod_count
# 0 A 0 1
# 1 A 1 2
# 2 B 1 2
# 3 C 0 1
# 4 C 1 1