How to multiply combinations of two sets of pandas dataframe columns - python

I would like to multiply the combinations of two sets of columns
Let say there is a dataframe below:
import pandas as pd
df = {'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9], 'D':[0,1,2]}
df = pd.DataFrame(df)
Now, I want to multiply AC, AD, BC, BD
This is like multiplying the combination of [A,B] and [C,D]
I tried to use itertools but failed to figure it out.
So, the desired output will be like:
output = {'AC':[7,16,27], 'AD':[0,2,6], 'BC':[28,40,54], 'BD':[0,5,12]}
output = pd.DataFrame(output)

IIUC, you can try
import itertools
cols1 = ['A', 'B']
cols2 = ['C', 'D']
for col1, col2 in itertools.product(cols1, cols2):
df[col1+col2] = df[col1] * df[col2]
print(df)
A B C D AC AD BC BD
0 1 4 7 0 7 0 28 0
1 2 5 8 1 16 2 40 5
2 3 6 9 2 27 6 54 12
Or with new create dataframe
out = pd.concat([df[col1].mul(df[col2]).to_frame(col1+col2)
for col1, col2 in itertools.product(cols1, cols2)], axis=1)
print(out)
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12

You can directly multiply multiple columns if you convert them to NumPy arrays first with .to_numpy()
>>> df[["A","B"]].to_numpy() * df[["C","D"]].to_numpy()
array([[ 7, 0],
[16, 5],
[27, 12]])
You can also unzip a collection of wanted pairs and use them to get a new view of your DataFrame (indexing the same column multiple times is fine) .. then multiplying together the two new NumPy arrays!
>>> import math # standard library for prod()
>>> pairs = ["AC", "AD", "BC", "BD"] # wanted pairs
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs) # new dataframe
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
This extends to any number of pairs (triples, octuples of columns..) as long as they're the same length (beware: zip() will silently drop extra columns beyond the shortest group)
>>> pairs = ["ABD", "BCD"]
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs)
ABD BCD
0 0 0
1 10 40
2 36 108

Related

Outer product on Pandas DataFrame rows

Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN.
Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way
Example:
Frame 1:
label
d1
d2
d3
a
1
2
3
b
4
5
6
Frame 2:
label
d1
d2
d3
c
7
8
9
d
10
11
12
Result:
label_1
label_2
d1
d2
d3
a
c
7
16
27
a
d
10
22
36
b
c
28
40
54
b
d
40
55
72
I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.
Let's do a cross merge first then mutiple the dn_x with dn_y
out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
.join(out.filter(regex='d.*_x')
.mul(out.filter(regex='d.*_y').values)
.rename(columns=lambda col: col.split('_')[0])))
print(out)
label_x label_y d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
first idea with DataFrame.reindex and MultiIndex created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df1['label'], df2['label']])
df = (df1.set_index('label').reindex(mux, level=0)
.mul(df2.set_index('label').reindex(mux, level=1))
.rename_axis(['label1','label2'])
.reset_index())
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
Or solution with cross join:
df = (df1.rename(columns={'label':'label1'})
.merge(df2.rename(columns={'label':'label2'}),
how='cross',
suffixes=('_','')))
For multiple columns get cols ends by _ and multiple same columns without _, last drop columns cols:
cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
One option is with a cross join, using expand_grid from pyjanitor, before computing the products:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'df1':df1, 'df2':df2}
out = jn.expand_grid(others=others)
numbers = out.select_dtypes('number')
numbers= numbers['df1'] * numbers['df2']
labels = out.select_dtypes('object')
labels.columns = ['label_1', 'label_2']
pd.concat([labels, numbers], axis = 1)
label_1 label_2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
OP here. Ynjxsjmh's answer allowed me to write the code to solve my problem, but I just wanted to post a function which is a little more general in the end and includes a little more explanation for anyone who stumbles here in the future.
Hit me with suggestions if you think of anything.
def exhaustive_df_operation(
self,
df1: pd.DataFrame,
df2: pd.DataFrame,
func: callable,
label_cols: list,
suffixes: tuple = ("_x", "_y"),
):
"""
Given DataFrames with multiple rows, executes the given
function on all row combinations ie in an exhaustive manner.
DataFrame column names must be the same. Label cols are the
columns which label the input/output and should not be used in
the computation.
Arguments:
df1: pd.DataFrame
First DataFrame to act on.
df2: pd.DataFrame
Second DataFrame to act on.
func: callable
numpy function to call as the operation on the DataFrames.
label_cols: list
The columns names corresponding to columns that label the
rows as distinct. Must be common to the DataFrames, but
several may be passed.
suffixes: tuple
The suffixes to use when calculating the cross merge.
Returns:
result: pd.DataFrame
DataFrame that results from product, will have
len(df1)*len(df2) rows. label_cols will label the DataFrame
from which the row was sourced.
eg. df1 df2
label a b label a b
i 1 2 k 5 6
j 3 4 l 7 8
func = np.add
label_cols = ['label']
suffixes = ("_x","_y")
result =
label_x label_y a b
i k 6 8
i l 8 10
j k 8 10
j l 10 12
"""
# Creating a merged DataFrame with an exhaustive "cross"
# product
merged = df1.merge(df2, how="cross", suffixes=suffixes)
# The names of the columns that will identify result rows
label_col_names = [col + suf for col in label_cols for suf in suffixes]
# The actual identifying columns
label_cols = merged[label_col_names]
# Non label columns ending suffix[0]
data_col_names = [
col
for col in merged.columns
if (suffixes[0] in col and col not in label_col_names)
]
data_1 = merged[data_col_names]
# Will need for rename later - removes suffix from column
# names with data
name_fix_dict = {old: old[: -len(suffixes[0])] for old in data_col_names}
# Non label columns ending suffix[1]
data_col_names = [
col
for col in merged.columns
if (suffixes[1] in col and col not in label_col_names)
]
data_2 = merged[data_col_names]
# Need .values because data_1 and data_2 have different column
# labels which confuses pandas/numpy.
result = label_cols.join(func(data_1, data_2.values))
# Removing suffixes from data columns
result.rename(columns=name_fix_dict, inplace=True)
return result

Sum up multiple columns into one columns [duplicate]

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

How to add a List of list to an existing data frame as separate columns

I have a data frame as below
df = pd.DataFrame([['aa', 1], ['bb', 2], ['cc', 3]])
0 1
0 aa 1
1 bb 2
2 cc 3
how can i add a list of lists li = [['xx',11], ['yy',22], ['zz',33]] to the data frame df so that each list inside the li will be added as a row . Expected output is as below:
0 1 2 3
0 aa 1 xx 11
1 bb 2 yy 22
2 cc 3 zz 33
currently i am looping through indexes of sublist and adding them to df
for i in range(len(li[0])):
df[str(df.shape[1])] = [x[i] for x in li]
is there any simpler way to do this without looping?
len(df) always equals to len(li)
all sublists in li are of the same length
Use DataFrame.join as:
df = df.join(pd.DataFrame(li, columns=df.columns+2))
print(df)
0 1 2 3
0 aa 1 xx 11
1 bb 2 yy 22
2 cc 3 zz 33
If the number of columns vary dynamically then:
df = df.join(pd.DataFrame(li, columns=df.columns+df.shape[1]))
If the number of columns are different then:
df = df.join(pd.DataFrame(li, columns=np.arange(len(li[0]))+df.shape[1]))

How to replace the first n elements in each row of a dataframe that are larger than a certain threshold

I have a huge dataframe that contains only numbers (the one I show below is just for demonstration purposes). My goal is to replace in each row of the dataframe the first n numbers that are larger than a certain value val by 0.
To give an example:
My dataframe could look like this:
c1 c2 c3 c4
0 38 10 1 8
1 44 12 17 46
2 13 6 2 7
3 9 16 13 26
If I now choose n = 2 (number of replacements) and val = 10, my desired output would look like this:
c1 c2 c3 c4
0 0 10 1 8
1 0 0 17 46
2 0 6 2 7
3 9 0 0 26
In the first row, only one value is larger than val so only one gets replaced, in the second row all values are larger than val but only the first two can be replaced. Analog for rows 3 and 4 (please note that not only the first two columns are affected but the first two values in a row which can be in any column).
A straightforward and very ugly implementation could look like this:
import numpy as np
import pandas as pd
np.random.seed(1)
col1 = [np.random.randint(1, 50) for ti in xrange(4)]
col2 = [np.random.randint(1, 50) for ti in xrange(4)]
col3 = [np.random.randint(1, 50) for ti in xrange(4)]
col4 = [np.random.randint(1, 50) for ti in xrange(4)]
df = pd.DataFrame({'c1': col1, 'c2': col2, 'c3': col3, 'c4': col4})
val = 10
n = 2
for ind, row in df.iterrows():
# number of replacements
re = 0
for indi, vali in enumerate(row):
if vali > val:
df.iloc[ind, indi] = 0
re += 1
if re == n:
break
That works but I am sure that there are much more efficient ways of doing this. Any ideas?
You could write your own a bit weird function and use apply with axis=1:
def f(x, n, m):
y = x.copy()
y[y[y > m].iloc[:n].index] = 0
return y
In [380]: df
Out[380]:
c1 c2 c3 c4
0 38 10 1 8
1 44 12 17 46
2 13 6 2 7
3 9 16 13 26
In [381]: df.apply(f, axis=1, n=2, m=10)
Out[381]:
c1 c2 c3 c4
0 0 10 1 8
1 0 0 17 46
2 0 6 2 7
3 9 0 0 26
Note: y = x.copy() needs to make a copy of the series. If you need to change your values inplace you could omit that line. You need extra y because with slicing you'll get a copy not the original object.

Pandas: sum DataFrame rows for given columns

I have the following DataFrame:
In [1]:
df = pd.DataFrame({'a': [1, 2, 3],
'b': [2, 3, 4],
'c': ['dd', 'ee', 'ff'],
'd': [5, 9, 1]})
df
Out [1]:
a b c d
0 1 2 dd 5
1 2 3 ee 9
2 3 4 ff 1
I would like to add a column 'e' which is the sum of columns 'a', 'b' and 'd'.
Going across forums, I thought something like this would work:
df['e'] = df[['a', 'b', 'd']].map(sum)
But it didn't.
I would like to know the appropriate operation with the list of columns ['a', 'b', 'd'] and df as inputs.
You can just sum and set param axis=1 to sum the rows, this will ignore none numeric columns:
In [91]:
df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4], 'c':['dd','ee','ff'], 'd':[5,9,1]})
df['e'] = df.sum(axis=1)
df
Out[91]:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
If you want to just sum specific columns then you can create a list of the columns and remove the ones you are not interested in:
In [98]:
col_list= list(df)
col_list.remove('d')
col_list
Out[98]:
['a', 'b', 'c']
In [99]:
df['e'] = df[col_list].sum(axis=1)
df
Out[99]:
a b c d e
0 1 2 dd 5 3
1 2 3 ee 9 5
2 3 4 ff 1 7
If you have just a few columns to sum, you can write:
df['e'] = df['a'] + df['b'] + df['d']
This creates new column e with the values:
a b c d e
0 1 2 dd 5 8
1 2 3 ee 9 14
2 3 4 ff 1 8
For longer lists of columns, EdChum's answer is preferred.
Create a list of column names you want to add up.
df['total']=df.loc[:,list_name].sum(axis=1)
If you want the sum for certain rows, specify the rows using ':'
This is a simpler way using iloc to select which columns to sum:
df['f']=df.iloc[:,0:2].sum(axis=1)
df['g']=df.iloc[:,[0,1]].sum(axis=1)
df['h']=df.iloc[:,[0,3]].sum(axis=1)
Produces:
a b c d e f g h
0 1 2 dd 5 8 3 3 6
1 2 3 ee 9 14 5 5 11
2 3 4 ff 1 8 7 7 4
I can't find a way to combine a range and specific columns that works e.g. something like:
df['i']=df.iloc[:,[[0:2],3]].sum(axis=1)
df['i']=df.iloc[:,[0:2,3]].sum(axis=1)
You can simply pass your dataframe into the following function:
def sum_frame_by_column(frame, new_col_name, list_of_cols_to_sum):
frame[new_col_name] = frame[list_of_cols_to_sum].astype(float).sum(axis=1)
return(frame)
Example:
I have a dataframe (awards_frame) as follows:
...and I want to create a new column that shows the sum of awards for each row:
Usage:
I simply pass my awards_frame into the function, also specifying the name of the new column, and a list of column names that are to be summed:
sum_frame_by_column(awards_frame, 'award_sum', ['award_1','award_2','award_3'])
Result:
Following syntax helped me when I have columns in sequence
awards_frame.values[:,1:4].sum(axis =1)
You can use the function aggragate or agg:
df[['a','b','d']].agg('sum', axis=1)
The advantage of agg is that you can use multiple aggregation functions:
df[['a','b','d']].agg(['sum', 'prod', 'min', 'max'], axis=1)
Output:
sum prod min max
0 8 10 1 5
1 14 54 2 9
2 8 12 1 4
The shortest and simplest way here is to use
df.eval('e = a + b + d')

Categories

Resources