Related
I have a dataframe:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
And a reference dictionary:
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
My goal is to replace all occurrences of replacement_dict['X1'] with 'X1', and then merge the rows together. For example, any instance of 'x1', 'x2', 'x3' or 'x4' will be replaced by 'X1', etc.
I can do this by selecting the rows that contain any of these strings and replacing them with 'X1':
keys = replacement_dict.keys()
for key in keys:
DF.loc[DF['lft'].isin(replacement_dict[key]), 'lft'] = key
DF.loc[DF['rgt'].isin(replacement_dict[key]), 'rgt'] = key
giving:
lft rel rgt num
0 t3 r3 Z1 3
1 t1 r3 X1 9
2 X1 r3 t2 8
3 X1 r1 t2 4
4 t1 r1 Z1 1
5 X1 r1 t2 2
6 X1 r2 t4 4
7 Z1 r2 t4 5
8 t4 r3 X1 4
9 Z1 r2 t3 4
Now, if I select all the rows containing 'X1' and merge them, I should end up with:
lft rel rgt num
0 X1 r3 t2 8
1 X1 r1 t2 6
2 X1 r2 t4 4
3 t1 r3 X1 9
4 t4 r3 X1 4
So the three columns ['lft', 'rel', 'rgt'] are unique while the 'num' column is added up for each of these rows. The row 1 above : ['X1' 'r1' 't2' 6] is the sum of two rows ['X1' 'r1' 't2' 4] and ['X1' 'r1' 't2' 2].
I can do this easily for a small number of rows, but I am working with a dataframe with 6 million rows and a replacement dictionary with 60,000 keys. This is taking forever using a simple row wise extraction and replacement.
How can this (specifically the last part) be scaled efficiently? Is there a pandas trick that someone can recommend?
Reverse the replacement_dict mapping and map() this new mapping to each of lft and rgt columns to substitute certain values (e.g. x1->X1, y2->Y1 etc.). As some values in lft and rgt columns don't exist in the mapping (e.g. t1, t2 etc.), call fillna() to fill in these values.1
You may also stack() the columns whose values need to be replaced (lft and rgt), call map+fillna and unstack() back but because there are only 2 columns, it may not be worth the trouble for this particular case.
The second part of the question may be answered by summing num values after grouping by lft, rel and rgt columns; so groupby().sum() should do the trick.
# reverse replacement map
reverse_map = {v : k for k, li in replacement_dict.items() for v in li}
# substitute values in lft column using reverse_map
df['lft'] = df['lft'].map(reverse_map).fillna(df['lft'])
# substitute values in rgt column using reverse_map
df['rgt'] = df['rgt'].map(reverse_map).fillna(df['rgt'])
# sum values in num column by groups
result = df.groupby(['lft', 'rel', 'rgt'], as_index=False)['num'].sum()
1: map() + fillna() may perform better for your use case than replace() because under the hood, map() implements a Cython optimized take_nd() method that performs particularly well if there are a lot of values to replace, while replace() implements replace_list() method which uses a Python loop. So if replacement_dict is particularly large (which it is in your case), the difference in performance will be huge, but if replacement_dict is small, replace() may outperform map().
If you flip the keys and values of your replacement_dict, things become a lot easier:
new_replacement_dict = {
v: key
for key, values in replacement_dict.items()
for v in values
}
cols = ["lft", "rel", "rgt"]
df[cols] = df[cols].replace(new_replacement_dict)
df.groupby(cols).sum()
Try this, I commented the steps
#reverse dict to dissolve the lists as values
reversed_dict = {v:k for k,val in replacement_dict.items() for v in val}
# replace the values
cols = ['lft', 'rel', 'rgt']
df[cols] = df[cols].replace(reversed_dict)
# filter rows where X1 is anywhere in the columns
df = df[df.eq('X1').any(axis=1)]
# sum the duplicate rows
out = df_filtered.groupby(cols).sum().reset_index()
print(out)
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Pandas has built in function replace that is faster than going through the whole dataframe with .loc
You can also pass a list in it making our dictionary good fit for it
keys = replacement_dict.keys()
# Loop through every value in our dictionary and get the replacements
for key in keys:
DF = DF.replace(to_replace=replacement_dict[key], value=key)
Here's a way to do what your question asks:
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df[(df.lft == 'X1') | (df.rgt == 'X1')]
.groupby(['lft','rel','rgt']).sum().reset_index() )
Output:
lft rel rgt num
0 X1 r1 t2 6
1 X1 r2 t4 4
2 X1 r3 t2 8
3 t1 r3 X1 9
4 t4 r3 X1 4
Explanation:
replace() uses a reversed version of the dictionary to replace items from lists in the original dict with the corresponding keys in the relevant df columns lft and rgt
after filtering for rows with 'X1' found in either lft or rgt, use groupby(), sum() and reset_index() to sum the num column for unique lft, rel, rgt group keys and restore the group components from index levels to columns.
As an alternative, we can use query() to select only rows containing 'X1':
df[['lft','rgt']] = ( df[['lft','rgt']]
.replace({it:k for k, v in replacement_dict.items() for it in v}) )
df = ( df.query("lft=='X1' or rgt=='X1'")
.groupby(['lft','rel','rgt']).sum().reset_index() )
lots of great answers. I avoid the need for the dict and use a df.apply() like this to generate new data.
import io
import pandas as pd
# # create the data
x = '''
lft rel rgt num
t3 r3 z2 3
t1 r3 x1 9
x2 r3 t2 8
x4 r1 t2 4
t1 r1 z3 1
x1 r1 t2 2
x2 r2 t4 4
z3 r2 t4 5
t4 r3 x3 4
z1 r2 t3 4
'''
data = io.StringIO(x)
df = pd.read_csv(data, sep=' ')
print(df)
replacement_dict = {
'X1' : ['x1', 'x2', 'x3', 'x4'],
'Y1' : ['y1', 'y2'],
'Z1' : ['z1', 'z2', 'z3']
}
def replace(x):
# which key to check
key_check = x[0] + '1'
key_check = key_check.upper()
return key_check
df['new'] = df['lft'].apply(replace)
df
return this:
lft rel rgt num
0 t3 r3 z2 3
1 t1 r3 x1 9
2 x2 r3 t2 8
3 x4 r1 t2 4
4 t1 r1 z3 1
5 x1 r1 t2 2
6 x2 r2 t4 4
7 z3 r2 t4 5
8 t4 r3 x3 4
9 z1 r2 t3 4
lft rel rgt num new
0 t3 r3 z2 3 T1
1 t1 r3 x1 9 T1
2 x2 r3 t2 8 X1
3 x4 r1 t2 4 X1
4 t1 r1 z3 1 T1
5 x1 r1 t2 2 X1
6 x2 r2 t4 4 X1
7 z3 r2 t4 5 Z1
8 t4 r3 x3 4 T1
9 z1 r2 t3 4 Z1
I am reading multiple JSON objects into one DataFrame. The problem is that some of the columns are lists. Also, the data is very big and because of that I cannot use the available solutions on the internet. They are very slow and memory-inefficient
Here is how my data looks like:
df = pd.DataFrame({'A': ['x1','x2','x3', 'x4'], 'B':[['v1','v2'],['v3','v4'],['v5','v6'],['v7','v8']], 'C':[['c1','c2'],['c3','c4'],['c5','c6'],['c7','c8']],'D':[['d1','d2'],['d3','d4'],['d5','d6'],['d7','d8']], 'E':[['e1','e2'],['e3','e4'],['e5','e6'],['e7','e8']]})
A B C D E
0 x1 [v1, v2] [c1, c2] [d1, d2] [e1, e2]
1 x2 [v3, v4] [c3, c4] [d3, d4] [e3, e4]
2 x3 [v5, v6] [c5, c6] [d5, d6] [e5, e6]
3 x4 [v7, v8] [c7, c8] [d7, d8] [e7, e8]
And this is the shape of my data: (441079, 12)
My desired output is:
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
.....
EDIT: After being marked as duplicate, I would like to stress on the fact that in this question I was looking for an efficient method of exploding multiple columns. Therefore the approved answer is able to explode an arbitrary number of columns on very large datasets efficiently. Something that the answers to the other question failed to do (and that was the reason I asked this question after testing those solutions).
pandas >= 0.25
Assuming all columns have the same number of lists, you can call Series.explode on each column.
df.set_index(['A']).apply(pd.Series.explode).reset_index()
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
The idea is to set as the index all columns that must NOT be exploded first, then reset the index after.
It's also faster.
%timeit df.set_index(['A']).apply(pd.Series.explode).reset_index()
%%timeit
(df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
2.22 ms ± 98.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
9.14 ms ± 329 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Use set_index on A and on remaining columns apply and stack the values. All of this condensed into a single liner.
In [1253]: (df.set_index('A')
.apply(lambda x: x.apply(pd.Series).stack())
.reset_index()
.drop('level_1', 1))
Out[1253]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
def explode(df, lst_cols, fill_value=''):
# make sure `lst_cols` is a list
if lst_cols and not isinstance(lst_cols, list):
lst_cols = [lst_cols]
# all columns except `lst_cols`
idx_cols = df.columns.difference(lst_cols)
# calculate lengths of lists
lens = df[lst_cols[0]].str.len()
if (lens > 0).all():
# ALL lists in cells aren't empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.loc[:, df.columns]
else:
# at least one list in cells is empty
return pd.DataFrame({
col:np.repeat(df[col].values, df[lst_cols[0]].str.len())
for col in idx_cols
}).assign(**{col:np.concatenate(df[col].values) for col in lst_cols}) \
.append(df.loc[lens==0, idx_cols]).fillna(fill_value) \
.loc[:, df.columns]
Usage:
In [82]: explode(df, lst_cols=list('BCDE'))
Out[82]:
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Building on #cs95's answer, we can use an if clause in the lambda function, instead of setting all the other columns as the index. This has the following advantages:
Preserves column order
Lets you easily specify columns using the set you want to modify, x.name in [...], or not modify x.name not in [...].
df.apply(lambda x: x.explode() if x.name in ['B', 'C', 'D', 'E'] else x)
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
As of pandas 1.3.0 (What’s new in 1.3.0 (July 2, 2021)):
DataFrame.explode() now supports exploding multiple columns. Its column argument now also accepts a list of str or tuples for exploding on multiple columns at the same time (GH39240)
So now this operation is as simple as:
df.explode(['B', 'C', 'D', 'E'])
A B C D E
0 x1 v1 c1 d1 e1
0 x1 v2 c2 d2 e2
1 x2 v3 c3 d3 e3
1 x2 v4 c4 d4 e4
2 x3 v5 c5 d5 e5
2 x3 v6 c6 d6 e6
3 x4 v7 c7 d7 e7
3 x4 v8 c8 d8 e8
Or if wanting unique indexing:
df.explode(['B', 'C', 'D', 'E'], ignore_index=True)
A B C D E
0 x1 v1 c1 d1 e1
1 x1 v2 c2 d2 e2
2 x2 v3 c3 d3 e3
3 x2 v4 c4 d4 e4
4 x3 v5 c5 d5 e5
5 x3 v6 c6 d6 e6
6 x4 v7 c7 d7 e7
7 x4 v8 c8 d8 e8
Gathering all of the responses on this and other threads, here is how I do it for comma-delineated rows:
from collections.abc import Sequence
import pandas as pd
import numpy as np
def explode_by_delimiter(
df: pd.DataFrame,
columns: str | Sequence[str],
delimiter: str = ",",
reindex: bool = True
) -> pd.DataFrame:
"""Convert dataframe with columns separated by a delimiter into an
ordinary dataframe. Requires pandas 1.3.0+."""
if isinstance(columns, str):
columns = [columns]
col_dict = {
col: df[col]
.str.split(delimiter)
# Without .fillna(), .explode() will fail on empty values
.fillna({i: [np.nan] for i in df.index})
for col in columns
}
df = df.assign(**col_dict).explode(columns)
return df.reset_index(drop=True) if reindex else df
Here is my solution using 'apply' function. Main features/differences:
offer options to specify selected multiple columns or all columns
offer options to specify values to fill in the 'missing' position (through parameter fill_mode = 'external'; 'internal'; or 'trim', explanation would be long, see examples below and try yourself to change the option and check the result)
Notes: option 'trim' was developed for my need, out of scope for this question
def lenx(x):
return len(x) if isinstance(x,(list, tuple, np.ndarray, pd.Series)) else 1
def cell_size_equalize2(row, cols='', fill_mode='internal', fill_value=''):
jcols = [j for j,v in enumerate(row.index) if v in cols]
if len(jcols)<1:
jcols = range(len(row.index))
Ls = [lenx(x) for x in row.values]
if not Ls[:-1]==Ls[1:]:
vals = [v if isinstance(v,list) else [v] for v in row.values]
if fill_mode=='external':
vals = [[e] + [fill_value]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e + [fill_value]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
elif fill_mode == 'internal':
vals = [[e]+[e]*(max(Ls)-1) if (not j in jcols) and (isinstance(row.values[j],list))
else e+[e[-1]]*(max(Ls)-lenx(e))
for j,e in enumerate(vals)]
else:
vals = [e[0:min(Ls)] for e in vals]
row = pd.Series(vals,index=row.index.tolist())
return row
Examples:
df=pd.DataFrame({
'a':[[1],2,3],
'b':[[4,5,7],[5,4],4],
'c':[[4,5],5,[6]]
})
print(df)
df1 = df.apply(cell_size_equalize2, cols='', fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', all columns, fill_value = \'OK\'\n', df1)
df2 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='external', fill_value = "OK", axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'external\', cols = [\'a\', \'b\'], fill_value = \'OK\'\n', df2)
df3 = df.apply(cell_size_equalize2, cols=['a', 'b'], fill_mode='internal', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'internal\', cols = [\'a\', \'b\']\n', df3)
df4 = df.apply(cell_size_equalize2, cols='', fill_mode='trim', axis=1).apply(pd.Series.explode)
print('\nfill_mode=\'trim\', all columns\n', df4)
Output:
a b c
0 [1] [4, 5, 7] [4, 5]
1 2 [5, 4] 5
2 3 4 [6]
fill_mode='external', all columns, fill_value = 'OK'
a b c
0 1 4 4
0 OK 5 5
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='external', cols = ['a', 'b'], fill_value = 'OK'
a b c
0 1 4 [4, 5]
0 OK 5 OK
0 OK 7 OK
1 2 5 5
1 OK 4 OK
2 3 4 6
fill_mode='internal', cols = ['a', 'b']
a b c
0 1 4 [4, 5]
0 1 5 [4, 5]
0 1 7 [4, 5]
1 2 5 5
1 2 4 5
2 3 4 6
fill_mode='trim', all columns
a b c
0 1 4 4
1 2 5 5
2 3 4 6
Lets say we have a df like below:
df = pd.DataFrame({'A':['y2','x3','z1','z1'],'B':['y2','x3','a2','z1']})
A B
0 y2 y2
1 x3 x3
2 z1 a2
3 z1 z1
if we wanted to sort the values on just the numbers in column A, we can do:
df.sort_values(by='A',key=lambda x: x.str[1])
A B
3 z1 z1
2 z1 a2
0 y2 y2
1 x3 x3
If we wanted to sort by both columns A and B, but have the key only apply to column A, is there a way to do that?
df.sort_values(by=['A','B'],key=lambda x: x.str[1])
Expected output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
You can sort by B, then sort by A with a stable method:
(df.sort_values('B')
.sort_values('A', key=lambda x: x.str[1], kind='mergesort')
)
Output:
A B
2 z1 a2
3 z1 z1
0 y2 y2
1 x3 x3
I happen to have the following DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({ 'Prod1': ['10','','10','','',''],
'Prod2': ['','5','5','','','5'],
'Prod3': ['','','','8','8','8'],
'String1': ['','','','','',''],
'String2': ['','','','','',''],
'String3': ['','','','','',''],
'X1': ['x1','x2','x3','x4','x5','x6'],
'X2': ['','','y1','','','y2']
})
print(df)
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y2
It's a schematic table of Products with associated Strings; the actual Strings are in columns (X1, X2), but they should eventually move to (String1, String2, String3) based on whether the corresponding product has a value or not.
For instance:
row 0 has a value on Prod1, hence x1 should move to String1.
row 1 has a value on Prod2, hence x2 should move to String2.
In the actual dataset, mostly each Prod has a single String, but there are rows where multiple values are found in the Prods, and the String columns should be filled giving priority to the left. The final result should look like:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1
1 5 x2
2 10 5 x3 y1
3 8 x4
4 8 x5
5 5 8 x6 y1
I was thinking about nested column/row loops, but I'm still not familiar enough with pandas to get to the solution.
Thank you very much in advance for any suggestion!
I break down the steps :
df[['String1', 'String2', 'String3']]=(df[['Prod1', 'Prod2', 'Prod3']]!='')
df1=df[['String1', 'String2', 'String3']].replace({False:np.nan}).stack().to_frame()
df1[0]=df[['X1','X2']].replace({'':np.nan}).stack().values
df[['String1', 'String2', 'String3']]=df1[0].unstack()
df.replace({None:''})
Out[1036]:
Prod1 Prod2 Prod3 String1 String2 String3 X1 X2
0 10 x1 x1
1 5 x2 x2
2 10 5 x3 y1 x3 y1
3 8 x4 x4
4 8 x5 x5
5 5 8 x6 y2 x6 y2
Need some help with data aggregaion in Python.
I have a Dataframe with 3 columns and N rows. First two columns contains indices (let it be X and Y), the last one contains values. The task is to calc a sum() of values of third column [corresponding with (x_i,y_j)] and write it in the new Dataframe in the intersection of (x_i,y_j)
Or, simplier, transform:
ind1 ind2 value
x1 y1 k1
x2 y1 k2
x3 y1 k3
x1 y2 k4
x2 y2 k5
x3 y2 k6
into some kind of 2d massive
y1 y2
________
x1 |k1 k4
x2 |k2 k5
x3 |k3 k6
I've tried pandas.groupby but didn't found proper solution. So, what should i do?
You want to pivot your data. Example:
In [5]: data = {'ind1': ['x1','x2','x3','x1','x2','x3'],
'ind2': ['y1','y1','y1','y2','y2','y2'],
'value': ['k1','k2','k3','k4','k5','k6']}
In [6]: pd.DataFrame(data=data)
Out[6]:
ind1 ind2 value
0 x1 y1 k1
1 x2 y1 k2
2 x3 y1 k3
3 x1 y2 k4
4 x2 y2 k5
5 x3 y2 k6
In [9]: df.pivot(index='ind1', columns='ind2', values='value')
Out[9]:
ind2 y1 y2
ind1
x1 k1 k4
x2 k2 k5
x3 k3 k6
You can find more information here: http://pandas.pydata.org/pandas-docs/stable/reshaping.html