Count group by where in intervall - python

I am a bit lost here on how to have an easy the solution in Python Pandas
I have a dataframe with 3 columns:
A B val
P1 P2 12
P1 P2 14
P2 P2 18
P2 P1 17
P1 P3 15
P1 P3 16
P1 P3 13
I want to count group by A and B, value in specifics intervalls, manually defined in another dataframe:
MIN MAX
12 12
13 15
16 17
The result should be the count number on the intervall and rest as presented:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
P1 P2 1 1 0 0
P2 P2 0 0 0 1
P2 P1 0 0 1 0
P1 P3 0 2 1 0
I want to have the result dynamically, if I change intervalls, remove or add other it should change column names or number in the final dataframe.
Thanks for help.

Try something like this using pd.cut:
df = pd.read_clipboard()
df2 = pd.read_clipboard()
df['labels']=pd.cut(df['val'],
bins=[0]+df2['MAX'].tolist()+[np.inf],
labels = [f'V_{s}_{e}' for s, e in zip(df2['MIN'], df2['MAX'])]+['V_OTHERS'])
df.groupby(['A','B','labels'])['labels'].count().unstack().reset_index()
Output:
labels A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P1 0 0 0 0
1 P1 P2 1 1 0 0
2 P1 P3 0 2 1 0
3 P2 P1 0 0 1 0
4 P2 P2 0 0 0 1
5 P2 P3 0 0 0 0

You can use a pd.IntervalIndex built from your MIN and MAX columns to cut the values before grouping:
import pandas as pd
# Your data here
df = pd.DataFrame({'A': {0: 'P1', 1: 'P1', 2: 'P2', 3: 'P2', 4: 'P1', 5: 'P1', 6: 'P1'}, 'B': {0: 'P2', 1: 'P2', 2: 'P2', 3: 'P1', 4: 'P3', 5: 'P3', 6: 'P3'}, 'val': {0: 12, 1: 14, 2: 18, 3: 17, 4: 15, 5: 16, 6: 13}})
intervals = pd.DataFrame({'MIN': {0: 12, 1: 13, 2: 16}, 'MAX': {0: 12, 1: 15, 2: 17}})
idx = pd.IntervalIndex.from_arrays(intervals["MIN"], intervals["MAX"] , closed="both")
intervals = pd.cut(df["val"], idx)
groups = [df["A"], df["B"]]
renamer = lambda x: f"V_{x.left}_{x.right}" if isinstance(x, pd.Interval) else x
out = pd.concat([
intervals.groupby(groups).value_counts().unstack(), # This handles all values within some interval
intervals.isna().groupby(groups).agg(V_OTHERS="sum") # This handles the V_OTHERS column
], axis=1).rename(columns=renamer).reset_index()
out:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P1 P3 0 2 1 0
2 P2 P1 0 0 1 0
3 P2 P2 0 0 0 1

pandas.IntervalIndex can do .loc on values.
Assuming df1 is your first dataframe with A, B, val columns and df2 is your second dataframe with MIN and MAX columns.
Using pandas.IntervalIndex and pandas.crosstab:
# Construct a interval index from df2 which is your MIN, MAX dataframe
ii = pd.IntervalIndex.from_arrays(df2["MIN"], df2["MAX"], closed="both")
# Then look for df1[val] values which fall in between MIN, MAX from df2
m = df1["val"].ge(df2["MIN"].min()) & df1["val"].le(df2["MAX"].max())
# Those values you locate using IntervalIndex and format the interval
# you found as you wanted i.e V_{}_{}
df1.loc[m, "interval"] = [
f"V_{x.left}_{x.right}"
for x in pd.DataFrame(index=ii).loc[df1.loc[m, "val"]].index
]
# Others with 'V_OTHERS'
df1.loc[~m, "interval"] = "V_OTHERS"
# Then use crosstab to find the sum of occurrences
out = (
pd.crosstab([df1["A"], df1["B"]], df1["interval"])
.reset_index()
.rename_axis("", axis=1)
)
print(out)
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P1 P3 0 2 1 0
2 P2 P1 0 0 1 0
3 P2 P2 0 0 0 1

Calling the second dataframe limits below:
diffs = np.subtract.outer(df["val"].to_numpy(),
limits.to_numpy()).reshape(len(df), -1)
from_min, from_max = diffs[:, ::2], diffs[:, 1::2]
counts = (pd.DataFrame((from_min >= 0) & (from_max <= 0))
.groupby([df["A"], df["B"]], sort=False).sum())
counts.columns = limits.astype(str).agg("_".join, axis=1).radd("V_")
counts["V_OTHERS"] = df.groupby(["A", "B"]).count().sub(counts.sum(axis=1), axis=0)
counts = counts.reset_index()
get the "cross" differences of "val" column values against the min & max limits each
that outer subtraction will give a shape "(len(df), *limits.shape)"
make it flattened in the last 2 dimensions to make it 2D to add as more columns
differentiate differences from_min and from_max
check if a value falls in between the ranges: greater than minimum, less than maximum
group these by "A" and "B" and sum those True/False's to count
pull out the names of the new columns from the contents of limits
row wise aggregation with "_" joining, and add from right "V_"
lastly compute the remainders
see however A & B pairs there are, and subtract the aforecomputed counts from them
and reset the index to move groupers to columns
to get
>>> counts
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P2 P2 0 0 0 1
2 P2 P1 0 0 1 0
3 P1 P3 0 2 1 0

Try this:
def find_group(val):
if 12 <= val <= 12:
return "V_12_12"
elif 13 <= val <= 15:
return "V_13_15"
elif 16 <= val <= 17:
return "V_16_17"
else:
return "V_OTHERS"
df = pd.DataFrame({
'A':['P1','P1','P2','P2','P1','P1','P1'],
'B':['P2','P2','P2','P1','P3','P3','P3'],
'val':[12,14,18,17,15,16,13]
})
df["group"]=df["val"].apply(find_group)
result=df.groupby(["A","B","group"]).count().unstack(fill_value=0).stack()
result.unstack()

Related

pandas count of rows for every combination of columns for one hot encoded data

Thank you so much for answering this :)
I have a pandas data frame where we have one hot encoded information about whether a customer has a certain product or not
customer revenue P1 P2 P3
Customer1 $1 1 0 0
Customer2 $2 1 1 0
Customer3 $3 0 1 1
Customer4 $4 1 1 1
Customer5 $5 1 0 0
for customer1 revenue is $1 and it has products P1 only,
similarly, customer4 has all products
I want to transform this data to show all possible combinations of the products and the count of customers that have those combinations, like so
combinations Count of Customers Sum of revenue
P1 only 2 $6 ---> c1+c5
P2 only
P3 only
P1+P2+P3 1 $4 ---> c4
P1+P2 1 $2 ---> c2
P1+P3
P2+P3 1 $3 ---> c3
What's the best way to achieve this?
I believe The combinations of products can be checked like this
import itertools
li=df.columns.to_list()
del(li[0:2])
list(itertools.combinations(li, 2))+list(itertools.combinations(li, 1))+list(itertools.combinations(li, 3))
The key here is to revert the operation of get_dummies (or OneHotEncoder) to get a unique combination of products per customer to be able to group and aggregate:
# Generate labels P1, P2, P3, P1+P2, ...
cols = ['P1', 'P2', 'P3']
labels = ['+'.join(c) for i in range(len(cols)) for c in combinations(cols, i+1)]
df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
.groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
'Count of customers': ('customer', 'nunique')}) \
.assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels)+1), labels)))) \
.set_index('Product').reindex(labels, fill_value=0).reset_index()
print(df1)
# Output
Product Sum of revenue Count of customers
0 P1 6 2
1 P2 0 0
2 P3 2 1
3 P1+P2 0 0
4 P1+P3 0 0
5 P2+P3 3 1
6 P1+P2+P3 4 1
For each unique combination, you assign a label:
# P1=1, P2=2, P3=4 (power of 2) so P1+P3=1+4=5
>>> dict(zip(range(1, len(labels)+1), labels))
{1: 'P1',
2: 'P2',
3: 'P3',
4: 'P1+P2',
5: 'P1+P3',
6: 'P2+P3',
7: 'P1+P2+P3'}
Detail about reverting the get_dummies operation:
>>> df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)
0 1 # P1 only
1 3 # P3 only
2 6 # P2+P3
3 7 # P1+P2+P3
4 1 # P1 only
dtype: int64
Update
Can you guide me with how to return the name of customers for each combination ?
df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
.groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
'Count of customers': ('customer', 'nunique'),
'Customers': ('customer', ', '.join)}) \
.assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels)+1), labels)))) \
.set_index('Product').reindex(labels, fill_value=0).reset_index() \
.replace({'Customers': {0: ''}})
print(df1)
# Output
Product Sum of revenue Count of customers Customers
0 P1 6 2 Customer1, Customer5
1 P2 0 0
2 P3 2 1 Customer2
3 P1+P2 0 0
4 P1+P3 0 0
5 P2+P3 3 1 Customer3
6 P1+P2+P3 4 1 Customer4
Since you are already using one-hot encoding, let's try to solve this problem using binary masks:
# Extract the product columns
cols = df.columns[2:]
# Make a bitmask for each column.
# 1 = b001 -> P1
# 2 = b010 -> P2
# 4 = b100 -> P3
bitmask = np.power(2, np.arange(len(cols)))
# Generate the label for each possible combination of products
# 0 = b000 -> Nothing
# 1 = b001 -> P1
# 2 = b010 -> P2
# 3 = b011 -> P1 + P2
# ...
# 7 = b111 -> P1 + P2 + P3
labels = np.array([
" + ".join(cols[np.bitwise_and(bitmask, i) != 0])
for i in range(2 ** len(cols))
])
labels[0] = "Nothing"
# Assign each row in the df to a group based on the
# product combination (P1, P1 + P2, etc.)
# Using numpy broadcasting to multiply the product list
# with the bitmask:
# df[cols] * bitmask[None, :] -> groups
# P1 P2 P3
# 1 0 0 1 2 4 1*1 + 0*2 + 0*4 = 1
# 1 1 0 1 2 4 1*1 + 1*2 + 0*4 = 3
# 0 1 1 1 2 4 0*1 + 1*2 + 1*4 = 6
# 1 1 1 1 2 4 ... = 7
# 1 0 0 1 2 4 ... = 1
groups = (df[cols].to_numpy() * bitmask[None, :]).sum(axis=1)
# The result
df.groupby(labels[groups]).agg(**{
"Count of Customers": ("customer", "count"),
"Sum of Revenue": ("revenue", "sum")
}).reindex(labels, fill_value=0)
You can do something like this:
In [2692]: from itertools import product
In [2673]: x = df.pivot(['P1', 'P2', 'P3'], 'customer', 'revenue')
In [2661]: ix = pd.MultiIndex.from_tuples(list(product(set(range(2)),repeat = 3)), names=['P1', 'P2', 'P3'])
In [2688]: s = pd.DataFrame(index=ix)
In [2681]: output = s.join(x).sum(1).to_frame('Sum of revenue')
In [2686]: output['Count of Customers'] = s.join(x).count(1)
In [2687]: output
Out[2687]:
Sum of revenue Count of Customers
P1 P2 P3
0 0 0 0.0 0
1 0.0 0
1 0 0.0 0
1 3.0 1
1 0 0 6.0 2
1 0.0 0
1 0 2.0 1
1 4.0 1
my_df = df.groupby(df.columns.tolist()[1:]).agg({'P1':'size', 'revenue':"sum"})
my_df.columns = ['Count of Customers ', 'Sum of revenue ']

Find number of consecutively increasing/decreasing values in a pandas column (and fill another col with it) in an optimized way

I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)

Mapping from multiindex to multiple columns

I have a huge multiindex dataframe. I whish to create new columns based on the part of the content of the multiindex. This is what I have:
arrays = [['bar', 'bar', 'bar', 'baz', 'baz', 'foo', 'foo','foo','qux', 'qux'],
['one', 'two', 'three', 'one', 'four', 'one', 'two', 'eight','one', 'two'],
['green', 'green', 'blue', 'blue', 'black', 'black', 'orange', 'green','blue', 'black'] ]
s = pd.DataFrame(np.random.randn(10), index=arrays)
s.index.names = ['p1','p2','p3']
s
0
p1 p2 p3
bar one green -0.676472
two green -0.030377
three blue -0.957517
baz one blue 0.710764
four black 0.404377
foo one black -0.286358
two orange -1.620832
eight green 0.316170
qux one blue -0.433310
two black 1.127754
This is what I want:
0 x1 x2 x3
p1 p2 p3
bar one green 1.563381 1 0 1
two green 0.193622 0 0 0
three blue 0.046728 1 0 0
baz one blue 0.098216 0 0 0
black 1.826574 0 1 0
foo one black -0.120856 1 1 1
two orange 0.605020 0 0 0
eight green 0.693606 0 0 0
qux one blue 0.588244 1 1 1
two black -0.872104 1 1 1
Now, in pseudo code I want to:
if (p1 =='bar') & (p2 == 'one') & (p3 == 'green'): s['x1'] = 1, s['x3'] = 1
if (p1 == 'bar') & (p3 == 'blue'): s['x1'] = 1
if (p1 == 'baz') & (p3 == 'black'): s['x2'] = 1
if (p1 =='foo') & (p2 == 'one') & (p3 == 'black'): s['x1'] = 1, s['x2'] = 1, s['x3'] = 1
if (p1 == 'qux'): s['x1'] = 1, s['x2'] = 1, s['x3'] = 1
I.e. based on the values of the columns of the multiindex, I want to assign 1 to the new x-columns.
I am looking for a vectorized approach like numpy.select (condition, choice), but I can't get numpy.select to work with more than one choice per condition.
Since I have 14 index columns I would prefer if the name of the column I condition on is used explicitly (i.e. (p1 == 'bar') & (p2 == 'one') is preferred instead of ['bar','one',]).
Any guidance would be much appreciated!
Thanks for the help!
Here is possible use selection by index slices and set columns by 1 like:
idx = pd.IndexSlice
s = s.assign(x1=0, x2=0, x3=0)
s.loc[idx['bar','one','green'], ['x1','x3']] = 1
s.loc[idx['bar',:,'blue'], ['x1']] = 1
s.loc[idx['baz',:,'black'], ['x2']] = 1
s.loc[idx['foo','one','black'], ['x1','x2','x3']] = 1
s.loc[idx['qux',:,:], ['x1','x2','x3']] = 1
print (s)
0 x1 x2 x3
p1 p2 p3
bar one green 0.152556 1 0 1
two green 0.488762 0 0 0
three blue 0.037346 1 0 0
baz one blue 1.903518 0 0 0
four black 0.589922 0 1 0
foo one black 0.871984 1 1 1
two orange 0.514062 0 0 0
eight green -0.177246 0 0 0
qux one blue 0.740046 1 1 1
two black 0.755664 1 1 1
EDIT:
def get_i(lev, val):
return s.index.get_level_values(lev) == val
s = s.assign(x1=0, x2=0, x3=0)
s.loc[get_i('p1','bar') & get_i('p2','one') & get_i('p3','green'), ['x1','x3']] = 1
s.loc[get_i('p1','bar') & get_i('p3','blue'), ['x1']] = 1
s.loc[get_i('p1','baz') & get_i('p3','black'), ['x2']] = 1
s.loc[get_i('p1','foo') & get_i('p2','one') & get_i('p3','black'), ['x1','x2','x3']] = 1
s.loc[get_i('p1','qux'), ['x1','x2','x3']] = 1
print (s)
0 x1 x2 x3
p1 p2 p3
bar one green -0.029773 1 0 1
two green -1.505461 0 0 0
three blue 1.819085 1 0 0
baz one blue 0.645498 0 0 0
four black -1.119554 0 1 0
foo one black 1.002072 1 1 1
two orange -0.461030 0 0 0
eight green -2.565080 0 0 0
qux one blue 0.286967 1 1 1
two black -0.522340 1 1 1
An extension to #jezrael's solution : combination of query and index slice could help with usage of the index names :
#conditions
cond1 = s.query('p1=="bar" and p2=="one" and p3=="green"').index
cond2 = s.query('p1=="bar" and p3=="blue"').index
cond3 = s.query('p1=="baz" and p3=="black"').index
cond4 = s.query('p1=="foo" and p2=="one" and p3=="black"').index
cond5 = s.query('p1=="qux"').index
idx = pd.IndexSlice
#create zero columns
s = s.assign(x1=0,x2=0,x3=0)
#assign values :
s.loc[idx[cond1], ["x1","x3"]] = 1
s.loc[idx[cond2], ["x1"]] = 1
s.loc[idx[cond3], ['x2']] = 1
s.loc[idx[cond4], ['x1', 'x2','x3']] = 1
s.loc[idx[cond5], ['x1', 'x2','x3']] = 1
0 x1 x2 x3
p1 p2 p3
bar one green 1.122544 1 0 1
two green 0.157234 0 0 0
three blue 0.760863 1 0 0
baz one blue -0.194400 0 0 0
four black 0.937159 0 1 0
foo one black -0.986325 1 1 1
two orange -0.002486 0 0 0
eight green 0.067649 0 0 0
qux one blue 1.024345 1 1 1
two black 0.884644 1 1 1

Assign cumulative values for flag for consecutive values in Pandas dataframe

x = pd.DataFrame({
'User': ['U1','U1','U1','U1','U1','U2','U2','U2'],
'Provider': ['P1','P1','P2','P1','P1','P1','P1','P2'],
'Provider_key': [100,100,101,100,100,100,100,101],
'Duration': [20,24,25,27,21,22,28,32]
})
This is what I want my dataframe to look like:
x = pd.DataFrame({
'User': ['U1','U1','U1','U1','U1','U2','U2','U2'],
'Provider': ['P1','P1','P2','P1','P1','P1','P1','P2'],
'Provider_key': ['100','100','101','100','100','100','100','101'],
'Duration': [20,24,25,27,21,22,28,32],
'Flag': [1,1,0,2,2,1,1,0]
})
I have tried using this:
x['Provider_key'].groupby([
x['User'],
x['Provider'],
x['Provider_key'].diff().ne(0).cumsum()
]).transform('size').ge(2).astype(int)
But this returns flag=1 in case of same values. How can I add cumsum in this to get the desired output?
I think you need Series.factorize per groups with swap order for count from bottom only for groups with 2 or more values - so added numpy.where by mask:
s = x['Provider_key'].diff().ne(0).cumsum()
s1 = x.iloc[::-1].groupby(['Provider', s])['User'].transform(lambda x: pd.factorize(x)[0]+1)
m = x.groupby(['User','Provider', s])['Provider_key'].transform('size').ge(2)
x['new'] = np.where(m, s1, 0)
print (x)
User Provider Provider_key Duration new
0 U1 P1 100 20 1
1 U1 P1 100 24 1
2 U1 P2 101 25 0
3 U1 P1 100 27 2
4 U1 P1 100 21 2
5 U2 P1 100 22 1
6 U2 P1 100 28 1
7 U2 P2 101 32 0

Can we combine conditional statements using column indexes in Python?

For the following dataframe:
Headers: Name P1 P2 P3
L1: A 1 0 2
L2: B 1 1 1
L3: C 0 5 6
I want to get yes where all P1, P2 and P3 are greater than 0.
Currently I am using either of the following methods:
Method1:
df['Check']= np.where((df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0),'Yes','No')
Method2:
df.loc[(df['P1'] > 0) & (df['P2'] > 0) & (df['P3'] > 0), 'Check'] = "Yes"
I have a large dataset with a lot of columns where the conditions are to be applied.
Is there a shorter alternative to the multiple & conditions, wherein I won't have to write the conditions for each and every variable and instead use a combined index range for the multiple columns?
I think need DataFrame.all for check all Trues per rows:
cols = ['P1','P2','P3']
df['Check']= np.where((df[cols] > 0).all(axis=1),'Yes','No')
print (df)
Name P1 P2 P3 Check
0 A 1 0 2 No
1 B 1 1 1 Yes
2 C 0 5 6 No
print ((df[cols] > 0))
P1 P2 P3
0 True False True
1 True True True
2 False True True
print ((df[cols] > 0).all(axis=1))
0 False
1 True
2 False
dtype: bool

Categories

Resources