Have two DataFrames with identical columns labels. Columns are label, data1, data2, ..., dataN.
Need to take the product of the DataFrames, multiplying data1 * data1, data2 * data2, etc for every possible combination of rows in DataFrame1 with the rows in DataFrame2. As such, want the resulting DataFrame to maintain the label column of both frames in some way
Example:
Frame 1:
label
d1
d2
d3
a
1
2
3
b
4
5
6
Frame 2:
label
d1
d2
d3
c
7
8
9
d
10
11
12
Result:
label_1
label_2
d1
d2
d3
a
c
7
16
27
a
d
10
22
36
b
c
28
40
54
b
d
40
55
72
I feel like there is a nice way to do this, but all I can come up with is gross loops with lots of memory reallocation.
Let's do a cross merge first then mutiple the dn_x with dn_y
out = df1.merge(df2, how='cross')
out = (out.filter(like='label')
.join(out.filter(regex='d.*_x')
.mul(out.filter(regex='d.*_y').values)
.rename(columns=lambda col: col.split('_')[0])))
print(out)
label_x label_y d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
first idea with DataFrame.reindex and MultiIndex created by MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df1['label'], df2['label']])
df = (df1.set_index('label').reindex(mux, level=0)
.mul(df2.set_index('label').reindex(mux, level=1))
.rename_axis(['label1','label2'])
.reset_index())
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
Or solution with cross join:
df = (df1.rename(columns={'label':'label1'})
.merge(df2.rename(columns={'label':'label2'}),
how='cross',
suffixes=('_','')))
For multiple columns get cols ends by _ and multiple same columns without _, last drop columns cols:
cols = df.filter(regex='_$').columns
no_ = cols.str.rstrip('_')
df[no_] *= df[cols].to_numpy()
df = df.drop(cols, axis=1)
print (df)
label1 label2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
One option is with a cross join, using expand_grid from pyjanitor, before computing the products:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'df1':df1, 'df2':df2}
out = jn.expand_grid(others=others)
numbers = out.select_dtypes('number')
numbers= numbers['df1'] * numbers['df2']
labels = out.select_dtypes('object')
labels.columns = ['label_1', 'label_2']
pd.concat([labels, numbers], axis = 1)
label_1 label_2 d1 d2 d3
0 a c 7 16 27
1 a d 10 22 36
2 b c 28 40 54
3 b d 40 55 72
OP here. Ynjxsjmh's answer allowed me to write the code to solve my problem, but I just wanted to post a function which is a little more general in the end and includes a little more explanation for anyone who stumbles here in the future.
Hit me with suggestions if you think of anything.
def exhaustive_df_operation(
self,
df1: pd.DataFrame,
df2: pd.DataFrame,
func: callable,
label_cols: list,
suffixes: tuple = ("_x", "_y"),
):
"""
Given DataFrames with multiple rows, executes the given
function on all row combinations ie in an exhaustive manner.
DataFrame column names must be the same. Label cols are the
columns which label the input/output and should not be used in
the computation.
Arguments:
df1: pd.DataFrame
First DataFrame to act on.
df2: pd.DataFrame
Second DataFrame to act on.
func: callable
numpy function to call as the operation on the DataFrames.
label_cols: list
The columns names corresponding to columns that label the
rows as distinct. Must be common to the DataFrames, but
several may be passed.
suffixes: tuple
The suffixes to use when calculating the cross merge.
Returns:
result: pd.DataFrame
DataFrame that results from product, will have
len(df1)*len(df2) rows. label_cols will label the DataFrame
from which the row was sourced.
eg. df1 df2
label a b label a b
i 1 2 k 5 6
j 3 4 l 7 8
func = np.add
label_cols = ['label']
suffixes = ("_x","_y")
result =
label_x label_y a b
i k 6 8
i l 8 10
j k 8 10
j l 10 12
"""
# Creating a merged DataFrame with an exhaustive "cross"
# product
merged = df1.merge(df2, how="cross", suffixes=suffixes)
# The names of the columns that will identify result rows
label_col_names = [col + suf for col in label_cols for suf in suffixes]
# The actual identifying columns
label_cols = merged[label_col_names]
# Non label columns ending suffix[0]
data_col_names = [
col
for col in merged.columns
if (suffixes[0] in col and col not in label_col_names)
]
data_1 = merged[data_col_names]
# Will need for rename later - removes suffix from column
# names with data
name_fix_dict = {old: old[: -len(suffixes[0])] for old in data_col_names}
# Non label columns ending suffix[1]
data_col_names = [
col
for col in merged.columns
if (suffixes[1] in col and col not in label_col_names)
]
data_2 = merged[data_col_names]
# Need .values because data_1 and data_2 have different column
# labels which confuses pandas/numpy.
result = label_cols.join(func(data_1, data_2.values))
# Removing suffixes from data columns
result.rename(columns=name_fix_dict, inplace=True)
return result
Related
I would like to multiply the combinations of two sets of columns
Let say there is a dataframe below:
import pandas as pd
df = {'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9], 'D':[0,1,2]}
df = pd.DataFrame(df)
Now, I want to multiply AC, AD, BC, BD
This is like multiplying the combination of [A,B] and [C,D]
I tried to use itertools but failed to figure it out.
So, the desired output will be like:
output = {'AC':[7,16,27], 'AD':[0,2,6], 'BC':[28,40,54], 'BD':[0,5,12]}
output = pd.DataFrame(output)
IIUC, you can try
import itertools
cols1 = ['A', 'B']
cols2 = ['C', 'D']
for col1, col2 in itertools.product(cols1, cols2):
df[col1+col2] = df[col1] * df[col2]
print(df)
A B C D AC AD BC BD
0 1 4 7 0 7 0 28 0
1 2 5 8 1 16 2 40 5
2 3 6 9 2 27 6 54 12
Or with new create dataframe
out = pd.concat([df[col1].mul(df[col2]).to_frame(col1+col2)
for col1, col2 in itertools.product(cols1, cols2)], axis=1)
print(out)
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
You can directly multiply multiple columns if you convert them to NumPy arrays first with .to_numpy()
>>> df[["A","B"]].to_numpy() * df[["C","D"]].to_numpy()
array([[ 7, 0],
[16, 5],
[27, 12]])
You can also unzip a collection of wanted pairs and use them to get a new view of your DataFrame (indexing the same column multiple times is fine) .. then multiplying together the two new NumPy arrays!
>>> import math # standard library for prod()
>>> pairs = ["AC", "AD", "BC", "BD"] # wanted pairs
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs) # new dataframe
AC AD BC BD
0 7 0 28 0
1 16 2 40 5
2 27 6 54 12
This extends to any number of pairs (triples, octuples of columns..) as long as they're the same length (beware: zip() will silently drop extra columns beyond the shortest group)
>>> pairs = ["ABD", "BCD"]
>>> result = math.prod(df[[*cols]].to_numpy() for cols in zip(*pairs))
>>> pd.DataFrame(result, columns=pairs)
ABD BCD
0 0 0
1 10 40
2 36 108
I have two dataframes with hundreds of columns.
Some have the same name, some do not.
I want the two dataframes to have the columns with same name listed in the same order.
Typically, if those were the only columns, I would do:
df2 = df2.filter(df1.columns)
However, because there are columns with different names, this would eliminate all columns in df2 that do not exists in df1.
How do I order all common columns with same order without losing the columns that are not in common? Those not in common must be kept in the original order. Because I have hundreds of columns I cannot do it manually but need a quick solution like "filter". Please, note that though there are similar questions, they do not deal with the case of "some columns are in common and some are different".
Example:
df1.columns = A,B,C,...,Z,1,2,...,1000
df2.columns = Z,K,P,T,...,01,02,...,01000
I want to reorder the columns for df2 to be:
df2.columns = A,B,C,...,Z,01,02,...,01000
Try sets operations on column names like intersection and difference:
Setup a MRE
>>> df1
A B C D
0 2 7 7 5
1 6 8 4 2
>>> df2
C B E F
0 8 7 3 2
1 8 6 5 8
c0 = df1.columns.intersection(df2.columns)
c1 = df1.columns.difference(df2.columns)
c2 = df2.columns.difference(df1.columns)
df1 = df1[c0.tolist() + c1.tolist()]
df2 = df2[c0.tolist() + c2.tolist()]
Output:
>>> df1
B C A D
0 7 7 2 5
1 8 4 6 2
>>> df2
B C E F
0 7 8 3 2
1 6 8 5 8
Assume you want to also keep columns that are not in common in the same place:
# make a copy of df2 column names
new_cols = df2.columns.values.copy()
# reorder common column names in df2 to be same order as df1
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
# reorder columns using new_cols
df2[new_cols]
Example:
df1 = pd.DataFrame([[1,2,3,4,5]], columns=list('badfe'))
df2 = pd.DataFrame([[1,2,3,4,5]], columns=list('fsxad'))
df1
b a d f e
0 1 2 3 4 5
df2
f s x a d
0 1 2 3 4 5
new_cols = df2.columns.values.copy()
new_cols[df2.columns.isin(df1.columns)] = df1.columns[df1.columns.isin(df2.columns)]
df2[new_cols]
a s x d f
0 4 2 3 5 1
You can do using pd.Index.difference and pd.index.union
i = df1.columns.intersection(df2.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
out = df2.loc[:,i]
df1 = pd.DataFrame(columns=list("ABCEFG"))
df2 = pd.DataFrame(columns=list("ECDAFGHI"))
print(df1)
print(df2)
i = df2.columns.intersection(df1.columns,sort=False).union(
df2.columns.difference(df1.columns),sort=False
)
print(df2.loc[:,i])
Empty DataFrame
Columns: [A, B, C, E, F, G]
Index: []
Empty DataFrame
Columns: [E, C, D, A, F, G, H, I]
Index: []
Empty DataFrame
Columns: [A, C, E, F, G, D, H, I]
Index: []
I have a data frame like this:
>df = pd.DataFrame({'A':['M',2,3],'B':['M',2,3],'AA':['N',20,30],'BB':['N',20,30]})
>df = df.rename(columns={df.columns[2]: 'A'})
>df = df.rename(columns={df.columns[3]: 'B'})
>df
A B A B
0 M M N N
1 2 2 20 20
2 3 3 30 30
and I have to split the data frame vertically by row index 0 = 'M' and 'N':
A B
0 M M
1 2 2
2 3 3
A B
0 N N
1 20 20
2 30 30
The data in the data frame comes from an Excel sheet and the column names are not unique.
Thanks for help!
This should get the job done:
df.loc[:,df.iloc[0, :] == "M"]
df.loc[:,df.iloc[0, :] == "N"]
Use pandas iloc for selecting columns:
=^..^=
import pandas as pd
df = pd.DataFrame({'A':['M',2,3],'B':['M',2,3],'AA':['N',20,30],'BB':['N',20,30]})
df = df.rename(columns={df.columns[2]: 'A'})
df = df.rename(columns={df.columns[3]: 'B'})
df1 = df.iloc[:, :2]
df2 = df.iloc[:, 2:]
Output:
A B
0 M M
1 2 2
2 3 3
A B
0 N N
1 20 20
2 30 30
Use list comprehension with loc as:
dfs = [df.loc[:, df.loc[0,:].eq(s)] for s in ['M','N']]
This gives seperate dataframes in list.
How am I supposed to remove the index column in the first row. I know it is not counted as a column but when I transpose the data frame, it does not allow me to use my headers anymore.
In[297] df = df.transpose()
print(df)
df = df.drop('RTM',1)
df = df.drop('Requirements', 1)
df = df.drop('Test Summary Report', 1)
print(df)
This throws me an error "labels ['RTM'] not contained in axis".
RTM is contained in an axis and this works if I do index_col=0
df = xl.parse(sheet_name,header=1, index_col=0, usecols="A:E", nrows=6, index_col=None)
but then I lose my (0,0) value "Artifact name" as a header. Any help will be appreciated.
You can do this with .iloc, to assign the column names to the first row after transposing. Then you can drop the first row, and clean up the name
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': list('ABCDE'),
'val1': np.arange(1,6,1),
'val2': np.arange(11,16,1)})
id val1 val2
0 A 1 11
1 B 2 12
2 C 3 13
3 D 4 14
4 E 5 15
Transpose and clean up the names
df = df.T
df.columns = df.iloc[0]
df = df.drop(df.iloc[0].index.name)
df.columns.name = None
df is now:
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Alternatively, just create a new DataFrame to begin with, specifying which column you want to be the header column.
header_col = 'id'
cols = [x for x in df.columns if x != header_col]
pd.DataFrame(df[cols].values.T, columns=df[header_col], index=cols)
Output:
id A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Using the setup from #ALollz:
df.set_index('id').rename_axis(None).T
A B C D E
val1 1 2 3 4 5
val2 11 12 13 14 15
Greeting
I try to get the smallest sizes dataframe that got valid row
import pandas as pd
import random
columns = ['x0','y0']
df_ = pd.DataFrame(index=range(0,30), columns=columns)
df_ = df_.fillna(0)
columns1 = ['x1','y1']
df = pd.DataFrame(index=range(0,11), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x1"] = random.randint(1, 100)
df.loc[index, "y1"] = random.randint(1, 100)
df_ = df_.combine_first(df)
df = pd.DataFrame(index=range(0,17), columns=columns1)
for index, row in df.iterrows():
df.loc[index, "x2"] = random.randint(1, 100)
df.loc[index, "y2"] = random.randint(1, 100)
df_ = df_.combine_first(df)
From the example the dataframe should output rows from 0 to 10 and the rest got filter out.
I think of keep a counter to keep track of the min row
or using pandasql
or if there is a trick to get this info from the dataframe
the size of dataframe
Actually I will be appending 500+ files with various size to append
and use it to do some analysis. So perf is a consideration.
-student of python
If you want to drop the rows which have NaNs use dropna (here, this is the first ten rows):
In [11]: df_.dropna()
Out[11]:
x0 x1 x2 y0 y1 y2
0 0 49 58 0 68 2
1 0 2 37 0 19 71
2 0 26 95 0 12 17
3 0 87 5 0 70 69
4 0 84 77 0 70 92
5 0 71 98 0 22 5
6 0 28 95 0 70 15
7 0 31 19 0 24 31
8 0 9 37 0 55 29
9 0 30 53 0 15 45
10 0 8 61 0 74 41
However a cleaner, more efficient, and faster way to do this entire process is to update just those first rows (I'm assuming the random integer stuff is just you generating some example dataframes).
Let's store your DataFrames in a list:
In [21]: df1 = pd.DataFrame([[1, 2], [np.nan, 4]], columns=['a', 'b'])
In [22]: df2 = pd.DataFrame([[1, 2], [5, 6], [7, 8]], columns=['a', 'c'])
In [23]: dfs = [df1, df2]
Take the minimum length:
In [24]: m = min(len(df) for df in dfs)
First create an empty DataFrame with the desired rows and columns:
In [25]: columns = reduce(lambda x, y: y.columns.union(x), dfs, [])
In [26]: res = pd.DataFrame(index=np.arange(m), columns=columns)
To do this efficiently we're going to update, and making these changes inplace - on just this DataFrame*:
In [27]: for df in dfs:
res.update(df)
In [28]: res
Out[28]:
a b c
0 1 2 2
1 5 4 6
*If we didn't do this, or were using combine_first or similar, we'd most likely have lots of copying (new DataFrames being created), which will slow things down.
Note: combine_first doesn't offer an inplace flag... you could use combine but this is also more complicated (as well as less efficient). It's also quite straightforward to use where (and manually update), which IIRC is what combine does under the hood.