Calculation between python pandas DataFrames, and store the result to dictionary - python

I'm trying some mapping and calculation between dataframes.
Is there any examples or anyone can help how to use python code to do this?
I've 2 dataframes:
products components
c1 c2 c3 c4
p1 1 0 1 0
p2 0 1 1 0
p3 1 0 0 1
p4 0 1 0 1
items cost
components i1 i2 i3 i4
c1 0 10 30 0
c2 20 10 0 0
c3 0 0 10 15
c4 20 0 0 30
The end results should be a dictionary contains the sum of the cost for each components and find the maximum:
{p1: [c1,c3] } -> {p1: [i2+i3,i3+i4] } -> {p1: [40,25] } -> {p1: 40 }
{p2: [c2,c3] } -> {p2: [i1+i2,i3+i4] } -> {p2: [30,25] } -> {p2: 30 }
{p3: [c1,c4] } -> {p3: [i2+i3,i1+i4] } -> {p3: [40,50] } -> {p3: 50 }
{p4: [c2,c4] } -> {p4: [i1+i2,i1+i4] } -> {p4: [30,50] } -> {p4: 50 }

Try (df1 is your first DataFrame, df2 the second):
print(df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1))
Prints:
p1 40
p2 30
p3 50
p4 50
dtype: int64
To store the result to dictionary:
out = dict(
df1.apply(lambda x: df2.loc[x[x.eq(1)].index].sum(axis=1).max(), axis=1)
)
print(out)
Prints:
{'p1': 40, 'p2': 30, 'p3': 50, 'p4': 50}

You can use itertools.compress() on products data to find components required:
prod_df["comps"] = prod_df.apply(lambda x: list(itertools.compress(x.index[1:], x.values[1:])), axis=1)
[Out]:
product c1 c2 c3 c4 comps
0 p1 1 0 1 0 [c1, c3, comps]
1 p2 0 1 1 0 [c2, c3, comps]
2 p3 1 0 0 1 [c1, c4, comps]
3 p4 0 1 0 1 [c2, c4, comps]
Then select respective component rows from components data and sum each row and filter max row:
prod_df["max_cost"] = prod_df.apply(lambda x: comp_df[comp_df["component"].isin(x["comps"])].iloc[:,1:].sum(axis=1).max(), axis=1)
[Out]:
product max_cost
0 p1 40
1 p2 30
2 p3 50
3 p4 50
Datasets used:
prod_data = [
("p1",1,0,1,0),
("p2",0,1,1,0),
("p3",1,0,0,1),
("p4",0,1,0,1),
]
prod_columns = ["product","c1","c2","c3","c4"]
prod_df = pd.DataFrame(data=prod_data, columns=prod_columns)
comp_data = [
("c1",0,10,30,0),
("c2",20,10,0,0),
("c3",0,0,10,15),
("c4",20,0,0,30),
]
comp_columns = ["component","i1","i2","i3","i4"]
comp_df = pd.DataFrame(data=comp_data, columns=comp_columns)

Related

Count group by where in intervall

I am a bit lost here on how to have an easy the solution in Python Pandas
I have a dataframe with 3 columns:
A B val
P1 P2 12
P1 P2 14
P2 P2 18
P2 P1 17
P1 P3 15
P1 P3 16
P1 P3 13
I want to count group by A and B, value in specifics intervalls, manually defined in another dataframe:
MIN MAX
12 12
13 15
16 17
The result should be the count number on the intervall and rest as presented:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
P1 P2 1 1 0 0
P2 P2 0 0 0 1
P2 P1 0 0 1 0
P1 P3 0 2 1 0
I want to have the result dynamically, if I change intervalls, remove or add other it should change column names or number in the final dataframe.
Thanks for help.
Try something like this using pd.cut:
df = pd.read_clipboard()
df2 = pd.read_clipboard()
df['labels']=pd.cut(df['val'],
bins=[0]+df2['MAX'].tolist()+[np.inf],
labels = [f'V_{s}_{e}' for s, e in zip(df2['MIN'], df2['MAX'])]+['V_OTHERS'])
df.groupby(['A','B','labels'])['labels'].count().unstack().reset_index()
Output:
labels A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P1 0 0 0 0
1 P1 P2 1 1 0 0
2 P1 P3 0 2 1 0
3 P2 P1 0 0 1 0
4 P2 P2 0 0 0 1
5 P2 P3 0 0 0 0
You can use a pd.IntervalIndex built from your MIN and MAX columns to cut the values before grouping:
import pandas as pd
# Your data here
df = pd.DataFrame({'A': {0: 'P1', 1: 'P1', 2: 'P2', 3: 'P2', 4: 'P1', 5: 'P1', 6: 'P1'}, 'B': {0: 'P2', 1: 'P2', 2: 'P2', 3: 'P1', 4: 'P3', 5: 'P3', 6: 'P3'}, 'val': {0: 12, 1: 14, 2: 18, 3: 17, 4: 15, 5: 16, 6: 13}})
intervals = pd.DataFrame({'MIN': {0: 12, 1: 13, 2: 16}, 'MAX': {0: 12, 1: 15, 2: 17}})
idx = pd.IntervalIndex.from_arrays(intervals["MIN"], intervals["MAX"] , closed="both")
intervals = pd.cut(df["val"], idx)
groups = [df["A"], df["B"]]
renamer = lambda x: f"V_{x.left}_{x.right}" if isinstance(x, pd.Interval) else x
out = pd.concat([
intervals.groupby(groups).value_counts().unstack(), # This handles all values within some interval
intervals.isna().groupby(groups).agg(V_OTHERS="sum") # This handles the V_OTHERS column
], axis=1).rename(columns=renamer).reset_index()
out:
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P1 P3 0 2 1 0
2 P2 P1 0 0 1 0
3 P2 P2 0 0 0 1
pandas.IntervalIndex can do .loc on values.
Assuming df1 is your first dataframe with A, B, val columns and df2 is your second dataframe with MIN and MAX columns.
Using pandas.IntervalIndex and pandas.crosstab:
# Construct a interval index from df2 which is your MIN, MAX dataframe
ii = pd.IntervalIndex.from_arrays(df2["MIN"], df2["MAX"], closed="both")
# Then look for df1[val] values which fall in between MIN, MAX from df2
m = df1["val"].ge(df2["MIN"].min()) & df1["val"].le(df2["MAX"].max())
# Those values you locate using IntervalIndex and format the interval
# you found as you wanted i.e V_{}_{}
df1.loc[m, "interval"] = [
f"V_{x.left}_{x.right}"
for x in pd.DataFrame(index=ii).loc[df1.loc[m, "val"]].index
]
# Others with 'V_OTHERS'
df1.loc[~m, "interval"] = "V_OTHERS"
# Then use crosstab to find the sum of occurrences
out = (
pd.crosstab([df1["A"], df1["B"]], df1["interval"])
.reset_index()
.rename_axis("", axis=1)
)
print(out)
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P1 P3 0 2 1 0
2 P2 P1 0 0 1 0
3 P2 P2 0 0 0 1
Calling the second dataframe limits below:
diffs = np.subtract.outer(df["val"].to_numpy(),
limits.to_numpy()).reshape(len(df), -1)
from_min, from_max = diffs[:, ::2], diffs[:, 1::2]
counts = (pd.DataFrame((from_min >= 0) & (from_max <= 0))
.groupby([df["A"], df["B"]], sort=False).sum())
counts.columns = limits.astype(str).agg("_".join, axis=1).radd("V_")
counts["V_OTHERS"] = df.groupby(["A", "B"]).count().sub(counts.sum(axis=1), axis=0)
counts = counts.reset_index()
get the "cross" differences of "val" column values against the min & max limits each
that outer subtraction will give a shape "(len(df), *limits.shape)"
make it flattened in the last 2 dimensions to make it 2D to add as more columns
differentiate differences from_min and from_max
check if a value falls in between the ranges: greater than minimum, less than maximum
group these by "A" and "B" and sum those True/False's to count
pull out the names of the new columns from the contents of limits
row wise aggregation with "_" joining, and add from right "V_"
lastly compute the remainders
see however A & B pairs there are, and subtract the aforecomputed counts from them
and reset the index to move groupers to columns
to get
>>> counts
A B V_12_12 V_13_15 V_16_17 V_OTHERS
0 P1 P2 1 1 0 0
1 P2 P2 0 0 0 1
2 P2 P1 0 0 1 0
3 P1 P3 0 2 1 0
Try this:
def find_group(val):
if 12 <= val <= 12:
return "V_12_12"
elif 13 <= val <= 15:
return "V_13_15"
elif 16 <= val <= 17:
return "V_16_17"
else:
return "V_OTHERS"
df = pd.DataFrame({
'A':['P1','P1','P2','P2','P1','P1','P1'],
'B':['P2','P2','P2','P1','P3','P3','P3'],
'val':[12,14,18,17,15,16,13]
})
df["group"]=df["val"].apply(find_group)
result=df.groupby(["A","B","group"]).count().unstack(fill_value=0).stack()
result.unstack()

pandas count of rows for every combination of columns for one hot encoded data

Thank you so much for answering this :)
I have a pandas data frame where we have one hot encoded information about whether a customer has a certain product or not
customer revenue P1 P2 P3
Customer1 $1 1 0 0
Customer2 $2 1 1 0
Customer3 $3 0 1 1
Customer4 $4 1 1 1
Customer5 $5 1 0 0
for customer1 revenue is $1 and it has products P1 only,
similarly, customer4 has all products
I want to transform this data to show all possible combinations of the products and the count of customers that have those combinations, like so
combinations Count of Customers Sum of revenue
P1 only 2 $6 ---> c1+c5
P2 only
P3 only
P1+P2+P3 1 $4 ---> c4
P1+P2 1 $2 ---> c2
P1+P3
P2+P3 1 $3 ---> c3
What's the best way to achieve this?
I believe The combinations of products can be checked like this
import itertools
li=df.columns.to_list()
del(li[0:2])
list(itertools.combinations(li, 2))+list(itertools.combinations(li, 1))+list(itertools.combinations(li, 3))
The key here is to revert the operation of get_dummies (or OneHotEncoder) to get a unique combination of products per customer to be able to group and aggregate:
# Generate labels P1, P2, P3, P1+P2, ...
cols = ['P1', 'P2', 'P3']
labels = ['+'.join(c) for i in range(len(cols)) for c in combinations(cols, i+1)]
df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
.groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
'Count of customers': ('customer', 'nunique')}) \
.assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels)+1), labels)))) \
.set_index('Product').reindex(labels, fill_value=0).reset_index()
print(df1)
# Output
Product Sum of revenue Count of customers
0 P1 6 2
1 P2 0 0
2 P3 2 1
3 P1+P2 0 0
4 P1+P3 0 0
5 P2+P3 3 1
6 P1+P2+P3 4 1
For each unique combination, you assign a label:
# P1=1, P2=2, P3=4 (power of 2) so P1+P3=1+4=5
>>> dict(zip(range(1, len(labels)+1), labels))
{1: 'P1',
2: 'P2',
3: 'P3',
4: 'P1+P2',
5: 'P1+P3',
6: 'P2+P3',
7: 'P1+P2+P3'}
Detail about reverting the get_dummies operation:
>>> df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)
0 1 # P1 only
1 3 # P3 only
2 6 # P2+P3
3 7 # P1+P2+P3
4 1 # P1 only
dtype: int64
Update
Can you guide me with how to return the name of customers for each combination ?
df1 = df.assign(R=df['revenue'].str.strip('$').astype(int),
P=df[cols].mul(2 ** np.arange(len(cols))).sum(axis=1)) \
.groupby('P').agg(**{'Sum of revenue': ('R', 'sum'),
'Count of customers': ('customer', 'nunique'),
'Customers': ('customer', ', '.join)}) \
.assign(Product=lambda x: x.index.map(dict(zip(range(1, len(labels)+1), labels)))) \
.set_index('Product').reindex(labels, fill_value=0).reset_index() \
.replace({'Customers': {0: ''}})
print(df1)
# Output
Product Sum of revenue Count of customers Customers
0 P1 6 2 Customer1, Customer5
1 P2 0 0
2 P3 2 1 Customer2
3 P1+P2 0 0
4 P1+P3 0 0
5 P2+P3 3 1 Customer3
6 P1+P2+P3 4 1 Customer4
Since you are already using one-hot encoding, let's try to solve this problem using binary masks:
# Extract the product columns
cols = df.columns[2:]
# Make a bitmask for each column.
# 1 = b001 -> P1
# 2 = b010 -> P2
# 4 = b100 -> P3
bitmask = np.power(2, np.arange(len(cols)))
# Generate the label for each possible combination of products
# 0 = b000 -> Nothing
# 1 = b001 -> P1
# 2 = b010 -> P2
# 3 = b011 -> P1 + P2
# ...
# 7 = b111 -> P1 + P2 + P3
labels = np.array([
" + ".join(cols[np.bitwise_and(bitmask, i) != 0])
for i in range(2 ** len(cols))
])
labels[0] = "Nothing"
# Assign each row in the df to a group based on the
# product combination (P1, P1 + P2, etc.)
# Using numpy broadcasting to multiply the product list
# with the bitmask:
# df[cols] * bitmask[None, :] -> groups
# P1 P2 P3
# 1 0 0 1 2 4 1*1 + 0*2 + 0*4 = 1
# 1 1 0 1 2 4 1*1 + 1*2 + 0*4 = 3
# 0 1 1 1 2 4 0*1 + 1*2 + 1*4 = 6
# 1 1 1 1 2 4 ... = 7
# 1 0 0 1 2 4 ... = 1
groups = (df[cols].to_numpy() * bitmask[None, :]).sum(axis=1)
# The result
df.groupby(labels[groups]).agg(**{
"Count of Customers": ("customer", "count"),
"Sum of Revenue": ("revenue", "sum")
}).reindex(labels, fill_value=0)
You can do something like this:
In [2692]: from itertools import product
In [2673]: x = df.pivot(['P1', 'P2', 'P3'], 'customer', 'revenue')
In [2661]: ix = pd.MultiIndex.from_tuples(list(product(set(range(2)),repeat = 3)), names=['P1', 'P2', 'P3'])
In [2688]: s = pd.DataFrame(index=ix)
In [2681]: output = s.join(x).sum(1).to_frame('Sum of revenue')
In [2686]: output['Count of Customers'] = s.join(x).count(1)
In [2687]: output
Out[2687]:
Sum of revenue Count of Customers
P1 P2 P3
0 0 0 0.0 0
1 0.0 0
1 0 0.0 0
1 3.0 1
1 0 0 6.0 2
1 0.0 0
1 0 2.0 1
1 4.0 1
my_df = df.groupby(df.columns.tolist()[1:]).agg({'P1':'size', 'revenue':"sum"})
my_df.columns = ['Count of Customers ', 'Sum of revenue ']

Assign cumulative values for flag for consecutive values in Pandas dataframe

x = pd.DataFrame({
'User': ['U1','U1','U1','U1','U1','U2','U2','U2'],
'Provider': ['P1','P1','P2','P1','P1','P1','P1','P2'],
'Provider_key': [100,100,101,100,100,100,100,101],
'Duration': [20,24,25,27,21,22,28,32]
})
This is what I want my dataframe to look like:
x = pd.DataFrame({
'User': ['U1','U1','U1','U1','U1','U2','U2','U2'],
'Provider': ['P1','P1','P2','P1','P1','P1','P1','P2'],
'Provider_key': ['100','100','101','100','100','100','100','101'],
'Duration': [20,24,25,27,21,22,28,32],
'Flag': [1,1,0,2,2,1,1,0]
})
I have tried using this:
x['Provider_key'].groupby([
x['User'],
x['Provider'],
x['Provider_key'].diff().ne(0).cumsum()
]).transform('size').ge(2).astype(int)
But this returns flag=1 in case of same values. How can I add cumsum in this to get the desired output?
I think you need Series.factorize per groups with swap order for count from bottom only for groups with 2 or more values - so added numpy.where by mask:
s = x['Provider_key'].diff().ne(0).cumsum()
s1 = x.iloc[::-1].groupby(['Provider', s])['User'].transform(lambda x: pd.factorize(x)[0]+1)
m = x.groupby(['User','Provider', s])['Provider_key'].transform('size').ge(2)
x['new'] = np.where(m, s1, 0)
print (x)
User Provider Provider_key Duration new
0 U1 P1 100 20 1
1 U1 P1 100 24 1
2 U1 P2 101 25 0
3 U1 P1 100 27 2
4 U1 P1 100 21 2
5 U2 P1 100 22 1
6 U2 P1 100 28 1
7 U2 P2 101 32 0

Custom function + groupby Pandas with different conditions on grouped by variables

I want to generate some weights using groupby on a data that originally looks like this :
V1 V2 MONTH CHOICES PRIORITY
X T1 M1 C1 1
X T1 M1 C2 0
X T1 M1 C3 0
X T2 M1 C1 1
X T2 M1 C5 0
X T2 M1 C6 0
X T2 M1 C2 1
X T1 M2 C1 1
X T1 M2 C2 0
X T1 M2 C3 0
X T2 M2 C1 0
X T2 M2 C5 1
X T2 M2 C6 0
X T2 M2 C2 1
Basically, when the MONTH is different than M1, I want to have flagged choices with weights equal to double any non flagged choice.
Example : if you have (C1, C2, C3) and C1 is the only one flagged, weights would be : 0.5 / 0.25 / 0.25.
On the same time, for the first month, I want the weights to be solely focused on flagged choices. Previous example would become (1/0/0).
Precision about the data :
For a given tuple (V1,V2,MONTH), we can have at most two choices flagged as priorities (no priority at all is a possibility).
Here's what I've tried :
def weights_preferences(data):
if (data.MONTH.values != 'M1'):
data['WEIGHTS'] = 1/(len(data)+data[data.PRIORITY==1].shape[0])
data['WEIGHTS'] = data.apply(lambda x : 2*x.WEIGHTS if x.PRIORITY==1 else x.WEIGHTS, axis=1)
elif data.MONTH.values == 'M1' & data[data.PRIORITY==1].shape[0]==0 :
data['WEIGHTS'] = 1/(len(data))
else :
if data[data.PREFERENCE==1].shape[0]==1 :
data['WEIGHTS'] = [1 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
else :
data['WEIGHTS'] = [0.5 if x[1].PRIORITY==1 else 0 for x in data.iterrows()]
return data
tmp = tmp.groupby(['V1','V2','MONTH']).apply(weights_preferences)
The problem is that since I groupby 'MONTH', it seems that the value no longer appears in data on which 'weights_preferences' is applied.
P.S : Output would look like this
V1 V2 MONTH CHOICES PRIORITY WEIGHTS
X T1 M1 C1 1 1
X T1 M1 C2 0 0
X T1 M1 C3 0 0
X T2 M1 C1 1 0.5
X T2 M1 C5 0 0
X T2 M1 C6 0 0
X T2 M1 C2 1 0.5
X T1 M2 C1 1 0.5
X T1 M2 C2 0 0.25
X T1 M2 C3 0 0.25
X T2 M2 C1 0 0.16
X T2 M2 C5 1 0.33
X T2 M2 C6 0 0.16
X T2 M2 C2 1 0.33
Any suggestions are very welcomed !
Thanks.

conditional sums for pandas aggregate

I just recently made the switch from R to python and have been having some trouble getting used to data frames again as opposed to using R's data.table. The problem I've been having is that I'd like to take a list of strings, check for a value, then sum the count of that string- broken down by user. So I would like to take this data:
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
And return:
A_id_grouped sum_up sum_down ... over_200_up
1: a1 1 0 ... 0
2: a2 0 1 0
3: a3 2 0 ... 1
4: a4 0 0 0
5: a5 0 0 ... 0
Before I did it with the R code (using data.table)
>DT[ ,list(A_id_grouped, sum_up = sum(B == "up"),
+ sum_down = sum(B == "down"),
+ ...,
+ over_200_up = sum(up == "up" & < 200), by=list(A)];
However all of my recent attempts with Python have failed me:
DT.agg({"D": [np.sum(DT[DT["B"]=="up"]),np.sum(DT[DT["B"]=="up"])], ...
"C": np.sum(DT[(DT["B"]=="up") & (DT["C"]>200)])
})
Thank you in advance! it seems like a simple question however I couldn't find it anywhere.
To complement unutbu's answer, here's an approach using apply on the groupby object.
>>> df.groupby('A_id').apply(lambda x: pd.Series(dict(
sum_up=(x.B == 'up').sum(),
sum_down=(x.B == 'down').sum(),
over_200_up=((x.B == 'up') & (x.C > 200)).sum()
)))
over_200_up sum_down sum_up
A_id
a1 0 0 1
a2 0 1 0
a3 1 0 2
a4 0 0 0
a5 0 0 0
There might be a better way; I'm pretty new to pandas, but this works:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A_id':'a1 a2 a3 a3 a4 a5'.split(),
'B': 'up down up up left right'.split(),
'C': [100, 102, 100, 250, 100, 102]})
df['D'] = (df['B']=='up') & (df['C'] > 200)
grouped = df.groupby(['A_id'])
def sum_up(grp):
return np.sum(grp=='up')
def sum_down(grp):
return np.sum(grp=='down')
def over_200_up(grp):
return np.sum(grp)
result = grouped.agg({'B': [sum_up, sum_down],
'D': [over_200_up]})
result.columns = [col[1] for col in result.columns]
print(result)
yields
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
An old question; I feel a better way, and avoiding the apply, would be to create a new dataframe, before grouping and aggregating:
df = df.set_index('A_id')
outcome = {'sum_up' : df.B.eq('up'),
'sum_down': df.B.eq('down'),
'over_200_up' : df.B.eq('up') & df.C.gt(200)}
outcome = pd.DataFrame(outcome).groupby(level=0).sum()
outcome
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Another option would be to unstack before grouping; however, I feel it is a longer, unnecessary process:
(df
.set_index(['A_id', 'B'], append = True)
.C
.unstack('B')
.assign(gt_200 = lambda df: df.up.gt(200))
.groupby(level='A_id')
.agg(sum_up=('up', 'count'),
sum_down =('down', 'count'),
over_200_up = ('gt_200', 'sum')
)
)
sum_up sum_down over_200_up
A_id
a1 1 0 0
a2 0 1 0
a3 2 0 1
a4 0 0 0
a5 0 0 0
Here, what I have recently learned using df assign and numpy's where method:
df3=
A_id B C
1: a1 "up" 100
2: a2 "down" 102
3: a3 "up" 100
3: a3 "up" 250
4: a4 "left" 100
5: a5 "right" 102
df3.assign(sum_up= np.where(df3['B']=='up',1,0),sum_down= np.where(df3['B']=='down',1,0),
over_200_up= np.where((df3['B']=='up') & (df3['C']>200),1,0)).groupby('A_id',as_index=False).agg({'sum_up':sum,'sum_down':sum,'over_200_up':sum})
outcome=
A_id sum_up sum_down over_200_up
0 a1 1 0 0
1 a2 0 1 0
2 a3 2 0 1
3 a4 0 0 0
4 a5 0 0 0
This also resembles with if you are familiar with SQL case and want to apply the same logic in pandas
select a,
sum(case when B='up' then 1 else 0 end) as sum_up
....
from table
group by a

Categories

Resources