How to transform tables using pandas

How to transform tables using pandas - python

---I have a csv dataset---
import pandas as pd
df = pd.DataFrame({'A':['a','a','a','a1','a1','a1','a1','a1','a1'], 'B':['b','b','b','b1','b1','b1','b1','b1','b1'], 'C':['c','c','c','c1','c1','c1','c1','c1','c1'], 'D':['d','d1','d2','d3','d4','d5','d6','d7','d8'], 'Rank':[1,2,3,1,2,3,4,5,6})
---I want to transform as in the following table ---
pd.pivot_table(df, values = ['D'] index=['A','B','C'], columns = 'Rank').reset_index()
---I didn't get what I want---
pd.DataFrame({'A':['a','a1'], 'B':['b','b1'], 'C':['c','c1'], '1':['d','d3'], '2':['d1','d4'], '3':['d2','d5'], '4':['NaN','d6'], '5':['NaN','d7'], '6':['NaN','d8'], '7':['NaN','NaN']})

You have to use pivot, not pivot_table in this case:
df.pivot(index=['A', 'B', 'C'], columns='Rank', values='D').reset_index()
Output:
Rank A B C 1 2 3 4 5 6
0 a b c d d1 d2 NaN NaN NaN
1 a1 b1 c1 d3 d4 d5 d6 d7 d8
pivot_table aggregates duplicates, but pivot doesn't. Which is what you want.
To remove axis name:
df.pivot(index=['A', 'B', 'C'], columns='Rank', values='D').reset_index().rename_axis(columns=None)
Output:
A B C 1 2 3 4 5 6
0 a b c d d1 d2 NaN NaN NaN
1 a1 b1 c1 d3 d4 d5 d6 d7 d8

Related

pandas - Convert split a cell with multiple values and insert new rows for each value [duplicate]

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient
Here is how my data looks like:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
A B C D E
0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2]
1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4]
2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6]
3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
And this is the shape of my data: (441079, 12)
My desired output is:
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
.....
EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

pandas >= 0.25
Assuming all columns have the same number of lists, you can call Series.explode on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.
In [1253]: (df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
Out[1253]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
Usage:
In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

Building on #cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:
Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8

As of pandas 1.3.0 (What’s new in 1.3.0 (July 2, 2021)):
DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)
So now this operation is as simple as:
df.explode(['B', 'C', 'D', 'E'])
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
Or if wanting unique indexing:
df.explode(['B', 'C', 'D', 'E'], ignore_index=True)
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

Gathering all of the responses on this and other threads, here is how I do it for comma-delineated rows:
from collections.abc import Sequence
import pandas as pd
import numpy as np
def explode_by_delimiter(
df: pd.DataFrame,
columns: str | Sequence[str],
delimiter: str = ",",
reindex: bool = True
) -> pd.DataFrame:
"""Convert dataframe with columns separated by a delimiter into an
ordinary dataframe. Requires pandas 1.3.0+."""
if isinstance(columns, str):
columns = [columns]
col_dict = {
col: df[col]
.str.split(delimiter)
# Without .fillna(), .explode() will fail on empty values
.fillna({i: [np.nan] for i in df.index})
for col in columns
}
df = df.assign(**col_dict).explode(columns)
return df.reset_index(drop=True) if reindex else df

Here is my solution using 'apply' function. Main features/differences:
offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)
Notes: option 'trim' was developed for my need, out of scope for this question
def lenx(x):
return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
Examples:
df=pd.DataFrame({
'a':[[1],2,3],
'b':[[4,5,7],[5,4],4],
'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)
Output:
a b c
0 [1] [4, 5, 7] [4, 5]
1 2 [5, 4] 5
2 3 4 [6]
fill_mode='external', all columns, fill_value = 'OK'
a b c
0 1 4 4
0 OK 5 5
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
a b c
0 1 4 [4, 5]
0 OK 5 OK
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='internal', cols = ['a', 'b']
a b c
0 1 4 [4, 5]
0 1 5 [4, 5]
0 1 7 [4, 5]
1 2 5 5
1 2 4 5
2 3 4 6
fill_mode='trim', all columns
a b c
0 1 4 4
1 2 5 5
2 3 4 6

Efficient way to use explode for list in columns in datframes with variable list sizes [duplicate]

I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient
Here is how my data looks like:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
A B C D E
0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2]
1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4]
2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6]
3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
And this is the shape of my data: (441079, 12)
My desired output is:
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
.....
EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).

pandas >= 0.25
Assuming all columns have the same number of lists, you can call Series.explode on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.
In [1253]: (df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
Out[1253]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
Usage:
In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

Building on #cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:
Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8

As of pandas 1.3.0 (What’s new in 1.3.0 (July 2, 2021)):
DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)
So now this operation is as simple as:
df.explode(['B', 'C', 'D', 'E'])
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
Or if wanting unique indexing:
df.explode(['B', 'C', 'D', 'E'], ignore_index=True)
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8

Gathering all of the responses on this and other threads, here is how I do it for comma-delineated rows:
from collections.abc import Sequence
import pandas as pd
import numpy as np
def explode_by_delimiter(
df: pd.DataFrame,
columns: str | Sequence[str],
delimiter: str = ",",
reindex: bool = True
) -> pd.DataFrame:
"""Convert dataframe with columns separated by a delimiter into an
ordinary dataframe. Requires pandas 1.3.0+."""
if isinstance(columns, str):
columns = [columns]
col_dict = {
col: df[col]
.str.split(delimiter)
# Without .fillna(), .explode() will fail on empty values
.fillna({i: [np.nan] for i in df.index})
for col in columns
}
df = df.assign(**col_dict).explode(columns)
return df.reset_index(drop=True) if reindex else df

Here is my solution using 'apply' function. Main features/differences:
offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)
Notes: option 'trim' was developed for my need, out of scope for this question
def lenx(x):
return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
Examples:
df=pd.DataFrame({
'a':[[1],2,3],
'b':[[4,5,7],[5,4],4],
'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)
Output:
a b c
0 [1] [4, 5, 7] [4, 5]
1 2 [5, 4] 5
2 3 4 [6]
fill_mode='external', all columns, fill_value = 'OK'
a b c
0 1 4 4
0 OK 5 5
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
a b c
0 1 4 [4, 5]
0 OK 5 OK
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='internal', cols = ['a', 'b']
a b c
0 1 4 [4, 5]
0 1 5 [4, 5]
0 1 7 [4, 5]
1 2 5 5
1 2 4 5
2 3 4 6
fill_mode='trim', all columns
a b c
0 1 4 4
1 2 5 5
2 3 4 6

Understanding the FutureWarning on using join_axes when concatenating with Pandas

I have two DataFrames:
df1:
A B C
1 A1 B1 C1
2 A2 B2 C2
df2:
B C D
3 B3 C3 D3
4 B4 C4 D4
Columns B and C are identical for both.
I'd like to concatenate them vertically and keep the columns of the first DataFrame:
pd.concat([df1, df2], join_axes=[df1.columns]):
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4
This works, but raises a
FutureWarning: The join_axes-keyword is deprecated. Use .reindex or .reindex_like on the result to achieve the same functionality.
I couldn't find (either in the documentation or through Google) how to "Use .reindex or .reindex_like on the result to achieve the same functionality".
Colab notebook illustrating issue: https://colab.research.google.com/drive/13EBq2z0Nh05JY7ovrdnLGtfeqdKVvZq0

Just like what the error mentioned add reindex
pd.concat([df1,df2.reindex(columns=df1.columns)])
Out[286]:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

df1 = pd.DataFrame({'A': ['A1', 'A2'], 'B': ['B1', 'B2'], 'C': ['C1', 'C2']})
df2 = pd.DataFrame({'B': ['B3', 'B4'], 'C': ['C3', 'C4'], 'D': ['D1', 'D2']})
pd.concat([df1, df2], sort=False)[df1.columns]
yields the desired result.

OR...
pd.concat([df1, df2], sort=False).reindex(df1.columns, axis=1)
Output:
A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

I have two dataframes, say df1 and df2, with the same column names.
Example:
df1
C1 | C2 | C3 | C4
A 1 2 AA
B 1 3 A
A 3 2 B
df2
C1 | C2 | C3 | C4
A 1 3 E
B 1 2 C
Q 4 1 Z
I would like to filter out rows in df1 based on common values in a fixed subset of columns between df1 and df2. In the above example, if the columns are C1 and C2, I would like the first two rows to be filtered out, as their values in both df1 and df2 for these columns are identical.
What would be a clean way to do this in Pandas?
So far, based on this answer, I have been able to find the common rows.
common_df = pandas.merge(df1, df2, how='inner', on=['C1','C2'])
This gives me a new dataframe with only those rows that have common values in the specified columns, i.e., the intersection.
I have also seen this thread, but the answers all seem to assume a difference on all the columns.
The expected result for the above example (rows common on specified columns removed):
C1 | C2 | C3 | C4
A 3 2 B

Maybe not the cleanest, but you could add a key column to df1 to check against.
Setting up the datasets
import pandas as pd
df1 = pd.DataFrame({ 'C1': ['A', 'B', 'A'],
'C2': [1, 1, 3],
'C3': [2, 3, 2],
'C4': ['AA', 'A', 'B']})
df2 = pd.DataFrame({ 'C1': ['A', 'B', 'Q'],
'C2': [1, 1, 4],
'C3': [3, 2, 1],
'C4': ['E', 'C', 'Z']})
Adding a key, using your code to find the commons
df1['key'] = range(1, len(df1) + 1)
common_df = pd.merge(df1, df2, how='inner', on=['C1','C2'])
df_filter = df1[~df1['key'].isin(common_df['key'])].drop('key', axis=1)

You can use an anti-join method where you do an outer join on the specified columns while returning the method of the join with an indicator. Only downside is that you'd have to rename and drop the extra columns after the join.
>>> import pandas as pd
>>> df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
>>> df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
>>> df_merged = df1.merge(df2, on=['C1','C2'], indicator=True, how='outer')
>>> df_merged
C1 C2 C3_x C4_x C3_y C4_y _merge
0 A 1 2.0 AA 3.0 E both
1 B 1 3.0 A 2.0 C both
2 A 3 2.0 B NaN NaN left_only
3 Q 4 NaN NaN 1.0 Z right_only
>>> df1_setdiff = df_merged[df_merged['_merge'] == 'left_only'].rename(columns={'C3_x': 'C3', 'C4_x': 'C4'}).drop(['C3_y', 'C4_y', '_merge'], axis=1)
>>> df1_setdiff
C1 C2 C3 C4
2 A 3 2.0 B
>>> df2_setdiff = df_merged[df_merged['_merge'] == 'right_only'].rename(columns={'C3_y': 'C3', 'C4_y': 'C4'}).drop(['C3_x', 'C4_x', '_merge'], axis=1)
>>> df2_setdiff
C1 C2 C3 C4
3 Q 4 1.0 Z

import pandas as pd
df1 = pd.DataFrame({'C1':['A','B','A'],'C2':[1,1,3],'C3':[2,3,2],'C4':['AA','A','B']})
df2 = pd.DataFrame({'C1':['A','B','Q'],'C2':[1,1,4],'C3':[3,2,1],'C4':['E','C','Z']})
common = pd.merge(df1, df2,on=['C1','C2'])
R1 = df1[~((df1.C1.isin(common.C1))&(df1.C2.isin(common.C2)))]
R2 = df2[~((df2.C1.isin(common.C1))&(df2.C2.isin(common.C2)))]
df1:
C1 C2 C3 C4
0 A 1 2 AA
1 B 1 3 A
2 A 3 2 B
df2:
C1 C2 C3 C4
0 A 1 3 E
1 B 1 2 C
2 Q 4 1 Z
common:
C1 C2 C3_x C4_x C3_y C4_y
0 A 1 2 AA 3 E
1 B 1 3 A 2 C
R1:
C1 C2 C3 C4
2 A 3 2 B
R2:
C1 C2 C3 C4
2 Q 4 1 Z

Pandas str alphabetically then numerically

This is probably a simple question and I just couldn't find the answer. In a pandas DataFrame like the one below, how can the objects be sorted first alphabetically and then numerically.
START:
import pandas as pd
d ={'col1': ['A1','B2','A10','A7','C4','C2','C22','B4']}
df = pd.DataFrame(data=d)
df
col1
0 A1
1 A7
2 A10
3 B2
4 B4
5 C2
6 C4
7 C22
WHAT I WANT TO GET:
col1
0 A1
1 A7
2 A10
3 B2
4 B4
5 C2
6 C4
7 C22
WHAT I GET:
>>>df.sort_values(by='col1')
col1
0 A1
2 A10
1 A7
3 B2
4 B4
5 C2
7 C22
6 C4

This is overkill to use Pandas to sort a list:
lot_file = pd.DataFrame()
lot_file['SPOOL'] = ['A39','B34','A3','B37','A6','B18','A48','B15','A47']
group_lots = lot_file.sort_values(by=['SPOOL'])
group_lots['SPOOL'].tolist()
Output:
['A3', 'A39', 'A47', 'A48', 'A6', 'B15', 'B18', 'B34', 'B37']
Or use sorted
spool_list = ['A39','B34','A3','B37','A6','B18','A48','B15','A47']
sorted(spool_list)
Output:
['A3', 'A39', 'A47', 'A48', 'A6', 'B15', 'B18', 'B34', 'B37']

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to transform tables using pandas - python

Related

pandas - Convert split a cell with multiple values and insert new rows for each value [duplicate]

Efficient way to use explode for list in columns in datframes with variable list sizes [duplicate]

Understanding the FutureWarning on using join_axes when concatenating with Pandas

How can I find the "set difference" of rows in two dataframes on a subset of columns in Pandas?

Pandas str alphabetically then numerically

Categories

Resources