I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.
I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).
It's almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.
From the dataframe below I need to calculate a new column based on the following spec in SQL:
CRITERIA
IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”
Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”
Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”
DATAFRAME
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined
0 MOST JEFF E 0 0 0 0 0 1 White
1 CRUISE TOM E 0 0 0 1 0 0 White
2 DEPP JOHNNY 0 0 0 0 0 1 Unknown
3 DICAP LEO 0 0 0 0 0 1 Unknown
4 BRANDO MARLON E 0 0 0 0 0 0 White
5 HANKS TOM 0 0 0 0 0 1 Unknown
6 DENIRO ROBERT E 0 1 0 0 0 1 White
7 PACINO AL E 0 0 0 0 0 1 White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White
9 EASTWOOD CLINT E 0 0 0 0 0 1 White
OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".
Next, use the apply function in pandas to apply the function - e.g.
df.apply (lambda row: label_race(row), axis=1)
Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.
df['race_label'] = df.apply (lambda row: label_race(row), axis=1)
The resultant dataframe looks like this (scroll to the right to see the new column):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
Since this is the first Google result for 'pandas new column from others', here's a simple example:
import pandas as pd
# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
# a b
# 0 1 3
# 1 2 4
# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0 4
# 1 6
# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
# a b c
# 0 1 3 4
# 1 2 4 6
If you get the SettingWithCopyWarning you can do it this way also:
fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'
Source: https://stackoverflow.com/a/12555510/243392
And if your column name includes spaces you can use syntax like this:
df = df.assign(**{'some column name': col.values})
And here's the documentation for apply, and assign.
The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:
First, define conditions:
conditions = [
df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1,
]
Now, define the corresponding outputs:
outputs = [
'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]
Finally, using numpy.select:
res = np.select(conditions, outputs, 'Other')
pd.Series(res)
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
Why should numpy.select be used over apply? Here are some performance checks:
df = pd.concat([df]*1000)
In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %%timeit
...: conditions = [
...: df['eri_hispanic'] == 1,
...: df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
...: df['eri_nat_amer'] == 1,
...: df['eri_asian'] == 1,
...: df['eri_afr_amer'] == 1,
...: df['eri_hawaiian'] == 1,
...: df['eri_white'] == 1,
...: ]
...:
...: outputs = [
...: 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
...: ]
...:
...: np.select(conditions, outputs, 'Other')
...:
...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.
.apply() takes in a function as the first parameter; pass in the label_race function as so:
df['race_label'] = df.apply(label_race, axis=1)
You don't need to make a lambda function to pass in a function.
try this,
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
O/P:
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \
0 MOST JEFF E 0 0 0
1 CRUISE TOM E 0 0 0
2 DEPP JOHNNY NaN 0 0 0
3 DICAP LEO NaN 0 0 0
4 BRANDO MARLON E 0 0 0
5 HANKS TOM NaN 0 0 0
6 DENIRO ROBERT E 0 1 0
7 PACINO AL E 0 0 0
8 WILLIAMS ROBIN E 0 0 1
9 EASTWOOD CLINT E 0 0 0
eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 0 0 1 White White
1 1 0 0 White Hispanic
2 0 0 1 Unknown White
3 0 0 1 Unknown White
4 0 0 0 White Other
5 0 0 1 Unknown White
6 0 0 1 White Two Or More
7 0 0 1 White White
8 0 0 0 White Haw/Pac Isl.
9 0 0 1 White White
use .loc instead of apply.
it improves vectorization.
.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.
for more details visit, .loc docs
Performance metrics:
Accepted Answer:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)
%timeit df.apply(lambda row: label_race(row), axis=1)
1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My Proposed Answer:
def label_race(df):
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)
%timeit label_race(df)
24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we inspect its source code, apply() is a syntactic sugar for a Python for-loop (via the apply_series_generator() method of the FrameApply class). Because it has the pandas overhead, it's generally slower than a Python loop.
Use optimized (vectorized) methods wherever possible. If you have to use a loop, use #numba.jit decorator.
1. Don't use apply() for an if-else ladder
df.apply() is just about the slowest way to do this in pandas. As shown in the answers of user3483203 and Mohamed Thasin ah, depending on the dataframe size, np.select() and df.loc may be 50-300 times faster than df.apply() to produce the same output.
As it happens, a loop implementation (not unlike apply()) with the #jit decorator from numba module is (about 50-60%) faster than df.loc and np.select.1
Numba works on numpy arrays, so before using the jit decorator, you need to convert the dataframe into a numpy array. Then fill in values in a pre-initialized empty array by checking the conditions in a loop. Since numpy arrays don't have column names, you have to access the columns by their index in the loop. The most inconvenient part of the if-else ladder in the jitted function over the one in apply() is accessing the columns by their indices. Otherwise it's almost the same implementation.
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1:
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1:
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1:
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
# the columns with the boolean data
cols = [c for c in df.columns if c.startswith('eri_')]
# initialize an empty array to be filled in a loop
# for string dtype arrays, we need to know the length of the longest string
# and use it to set the dtype
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
# pass the underlying numpy array of `df[cols]` into the jitted function
df['rno_defined'] = conditional_assignment(df[cols].values, res)
2. Don't use apply() for numeric operations
If you need to add a new row by adding two columns, your first instinct may be to write
df['c'] = df.apply(lambda row: row['a'] + row['b'], axis=1)
But instead of this, row-wise add using sum(axis=1) method (or + operator if there are only a couple of columns):
df['c'] = df[['a','b']].sum(axis=1)
# equivalently
df['c'] = df['a'] + df['b']
Depending on the dataframe size, sum(1) may be 100s of times faster than apply().
In fact, you will almost never need apply() for numeric operations on a pandas dataframe because it has optimized methods for most operations: addition (sum(1)), subtraction (sub() or diff()), multiplication (prod(1)), division (div() or /), power (pow()), >, >=, ==, %, //, &, | etc. can all be performed on the entire dataframe without apply().
For example, let's say you want to create a new column using the following rule:
IF [colC] > 0 THEN RETURN [colA] * [colB]
ELSE RETURN [colA] / [colB]
Using the optimized pandas methods, this can be written as
df['new'] = df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
the equivalent apply() solution is:
df['new'] = df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
The approach using the optimized methods is 250 times faster than the equivalent apply() approach for dataframes with 20k rows. This gap only increases as the data size increases (for a dataframe with 1 mil rows, it's 365 times faster) and the time difference will become more and more noticeable.2
1: In the below result, I show the performance of the three approaches using a dataframe with 24 mil rows (this is the largest frame I can construct on my machine). For smaller frames, the numba-jitted function consistently runs at least 50% faster than the other two as well (you can check yourself).
def pd_loc(df):
df['rno_defined'] = 'Other'
df.loc[df['eri_nat_amer'] == 1, 'rno_defined'] = 'A/I AK Native'
df.loc[df['eri_asian'] == 1, 'rno_defined'] = 'Asian'
df.loc[df['eri_afr_amer'] == 1, 'rno_defined'] = 'Black/AA'
df.loc[df['eri_hawaiian'] == 1, 'rno_defined'] = 'Haw/Pac Isl.'
df.loc[df['eri_white'] == 1, 'rno_defined'] = 'White'
df.loc[df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1) > 1, 'rno_defined'] = 'Two Or More'
df.loc[df['eri_hispanic'] == 1, 'rno_defined'] = 'Hispanic'
return df
def np_select(df):
conditions = [df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1]
outputs = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']
df['rno_defined'] = np.select(conditions, outputs, 'Other')
return df
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1 :
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1 :
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1 :
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
def nb_loop(df):
cols = [c for c in df.columns if c.startswith('eri_')]
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
df['rno_defined'] = conditional_assignment(df[cols].values, res)
return df
# df with 24mil rows
n = 4_000_000
df = pd.DataFrame({
'eri_afr_amer': [0, 0, 0, 0, 0, 0]*n,
'eri_asian': [1, 0, 0, 0, 0, 0]*n,
'eri_hawaiian': [0, 0, 0, 1, 0, 0]*n,
'eri_hispanic': [0, 1, 0, 0, 1, 0]*n,
'eri_nat_amer': [0, 0, 0, 0, 1, 0]*n,
'eri_white': [0, 0, 1, 1, 0, 0]*n
}, dtype='int8')
df.insert(0, 'name', ['MOST', 'CRUISE', 'DEPP', 'DICAP', 'BRANDO', 'HANKS']*n)
%timeit nb_loop(df)
# 5.23 s ± 45.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit pd_loc(df)
# 7.97 s ± 28.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit np_select(df)
# 8.5 s ± 39.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
2: In the below result, I show the performance of the two approaches using a dataframe with 20k rows and again with 1 mil rows. For smaller frames, the gap is smaller because the optimized approach has an overhead while apply() is a loop. As the size of the frame increases, the vectorization overhead cost diminishes w.r.t. to the overall runtime of the code while apply() remains a loop over the frame.
n = 20_000 # 1_000_000
df = pd.DataFrame(np.random.rand(n,3)-0.5, columns=['colA','colB','colC'])
%timeit df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
# n = 20000: 2.69 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# n = 1000000: 86.2 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
# n = 20000: 679 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# n = 1000000: 31.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Yet another (easily generalizable) approach, whose corner-stone is pandas.DataFrame.idxmax. First, the easily generalizable preamble.
# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'
# The "dictionary-based" if-else statements
labels = {
_gt_1_key : 'Two Or More',
'eri_hispanic': 'Hispanic',
'eri_nat_amer': 'A/I AK Native',
'eri_asian' : 'Asian',
'eri_afr_amer': 'Black/AA',
'eri_hawaiian': 'Haw/Pac Isl.',
'eri_white' : 'White',
_lt_1_key : 'Other',
}
# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy() # `~.copy` to avoid `SettingWithCopyWarning`
... and, finally, in a vectorized fashion:
mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label = mat.idxmax(axis=1).map(labels)
where
>>> race_label
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
that is a pandas.Series instance you can easily host within df, i.e. doing df['race_label'] = race_label.
Choosing a method according to the complexity of the criteria
For the examples below - in order to show multiple types of rules for the new column - we will assume a DataFrame with columns 'red', 'green' and 'blue', containing floating-point values ranging 0 to 1.
General case: .apply
As long as the necessary logic to compute the new value can be written as a function of other values in the same row, we can use the .apply method of the DataFrame to get the desired result. Write the function so that it accepts a single parameter, which is a single row of the input:
def as_hex(value):
# clamp to avoid rounding errors etc.
return min(max(0, int(value * 256)), 255)
def hex_color(row):
r, g, b = as_hex(row['red']), as_hex(row['green']), as_hex(row['blue'])
return f'#{r:02x}{g:02x}{b:02x}'
Pass the function itself (don't write parentheses after the name) to .apply, and specify axis=1 (meaning to supply rows to the categorizing function, so as to compute a column - rather than the other way around). Thus:
df['hex_color'] = df.apply(hex_color, axis=1)
Note that wrapping in lambda is not necessary, since we are not binding any arguments or otherwise modifying the function.
The .apply step is necessary because the conversion function itself is not vectorized. Thus, a naive approach like df['color'] = hex_color(df) will not work (example question).
This tool is powerful, but inefficient. For best performance, please use a more specific approach where applicable.
Multiple choices with conditions: numpy.select or repeated assignment with df.loc or df.where
Suppose we were thresholding the color values, and computing rough color names like so:
def additive_color(row):
# Insert here: logic that takes values from the `row` and computes
# the desired cell value for the new column in that row.
# The `row` is an ordinary `Series` object representing a row of the
# original `DataFrame`; it can be indexed with column names, thus:
if row['red'] > 0.5:
if row['green'] > 0.5:
return 'white' if row['blue'] > 0.5 else 'yellow'
else:
return 'magenta' if row['blue'] > 0.5 else 'red'
elif row['green'] > 0.5:
return 'cyan' if row['blue'] > 0.5 else 'green'
else:
return 'blue' if row['blue'] > 0.5 else 'black'
In cases like this - where the categorizing function would be an if/else ladder, or match/case in 3.10 and up - we may get much faster performance using numpy.select.
This approach works very differently. First, compute masks on the data for where each condition applies:
black = (df['red'] <= 0.5) & (df['green'] <= 0.5) & (df['blue'] <= 0.5)
white = (df['red'] > 0.5) & (df['green'] > 0.5) & (df['blue'] > 0.5)
To call numpy.select, we need two parallel sequences - one of the conditions, and another of the corresponding values:
df['color'] = np.select(
[white, black],
['white', 'black'],
'colorful'
)
The optional third argument specifies a value to use when none of the conditions are met. (As an exercise: fill in the remaining conditions, and try it without a third argument.)
A similar approach is to make repeated assignments based on each condition. Assign the default value first, and then use df.loc to assign specific values for each condition:
df['color'] = 'colorful'
df.loc[white, 'color'] = 'white'
df.loc[black, 'color'] = 'black'
Alternately, df.where can be used to do the assignments. However, df.where, used like this, assigns the specified value in places where the condition is not met, so the conditions must be inverted:
df['color'] = 'colorful'
df['color'] = df['color'].where(~white, 'white').where(~black, 'black')
Simple mathematical manipulations: built-in mathematical operators and broadcasting
For example, an apply-based approach like:
def brightness(row):
return row['red'] * .299 + row['green'] * .587 + row['blue'] * .114
df['brightness'] = df.apply(brightness, axis=1)
can instead be written by broadcasting the operators, for much better performance (and is also simpler):
df['brightness'] = df['red'] * .299 + df['green'] * .587 + df['blue'] * .114
As an exercise, here's the first example redone that way:
def as_hex(column):
scaled = (column * 256).astype(int)
clamped = scaled.where(scaled >= 0, 0).where(scaled <= 255, 255)
return clamped.apply(lambda i: f'{i:02x}')
df['hex_color'] = '#' + as_hex(df['red']) + as_hex(df['green']) + as_hex(df['blue'])
I was unable to find a vectorized equivalent to format the integer values as hex strings, so .apply is still used internally here - meaning that the full speed penalty still comes into play. Still, this demonstrates some general techniques.
For more details and examples, see cottontail's answer.
I have a dataframe with one column containing numbers (quantity). Every row represents one day so whole dataframe is should be treated as sequential data. I want to add second column that would calculate cumulative sum of the quantity column but if at any point cumulative sum is greater than 0, next row should start counting cumulative sum from 0.
I solved this problem using iterrows() but I read that this function is very inefficient and having millions of rows, calculation takes over 20 minutes. My solution below:
import pandas as pd
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
print(df)
# quantity outcome
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 15 11.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
# -1 -4.0
# 5 1.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# 15 14.0 <- since this is greater than 0, next line will start counting from 0
# -1 -1.0
# -1 -2.0
# -1 -3.0
Is there faster (more optimized way) to calculate this?
I'm also not sure if the "if index == 0" block is the best solution and if this can be solved in more elegant way? Without this block there is an error since in first row there cannot be "previous row" for calculation.
Iterating over DataFrame rows is very slow and should be avoided. Working with chunks of data is the way to go with pandas.
For you case, looking at your DataFrame column quantity as a numpy array, the code below should speed up the process quite a lot compared to your approach:
import pandas as pd
import numpy as np
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
print(df)
Out :
quantity outcome
0 -1 -1.0
1 -1 -2.0
2 -1 -3.0
3 -1 -4.0
4 15 11.0
5 -1 -1.0
6 -1 -2.0
7 -1 -3.0
8 -1 -4.0
9 5 1.0
10 -1 -1.0
11 15 14.0
12 -1 -1.0
13 -1 -2.0
14 -1 -3.0
If you still need more speed, suggest to have a look at numba as per jezrael answer.
Edit - Performance test
I got curious about performance and did this module with all 3 approaches.
I haven't optimised the individual functions, just copied the code from OP and jezrael answer with minor changes.
"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.
Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np
def pditerrows(df):
"""Iterate over DataFrame using `iterrows`"""
for index, row in df.iterrows():
if index == 0:
df.loc[index, 'outcome'] = df.loc[index, 'quantity']
else:
previous_outcome = df.loc[index-1, 'outcome']
if previous_outcome > 0:
previous_outcome = 0
df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
return df
def nparray(df):
"""Convert DataFrame column to `numpy` arrays."""
x = np.array(df.quantity)
y = np.zeros(x.size)
total = 0
for i, xi in enumerate(x):
total += xi
y[i] = total
total = total if total < 0 else 0
df['outcome'] = y
return df
#njit
def f(x, lim):
result = np.empty(len(x))
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
def numbaloop(df):
"""Convert DataFrame to `numpy` arrays and loop using `numba`.
See [https://stackoverflow.com/a/69750009/5069105]
"""
df['outcome'] = f(df.quantity.to_numpy(), 0)
return df
def create_df(size):
"""Create a DataFrame filed with -1's and 15's, with 90% of
the entries equal to -1 and 10% equal to 15, randomly
placed in the array.
"""
df = pd.DataFrame(
np.random.choice(
(-1, 15),
size=size,
p=[0.9, 0.1]
),
columns=['quantity'])
return df
# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))
Running for a somewhat small array, size = 20_000, leads to:
In: import bench_dataframe as bd
.. df = bd.create_df(size=20_000)
In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Here numpy arrays were 700+ times faster than iterrows(), and numba was still 22 times faster than numpy.
And for larger arrays, size = 200_000, we get:
In: import bench_dataframe as bd
.. df = bd.create_df(size=200_000)
In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P
In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Making numba again 25+ times faster than numpy arrays for this example, and confirming that you should avoid at all costs to use iterrows() for anything more than a couple of hundreds of rows.
I think numba is the best when working with loops if performance is important:
#njit
def f(x, lim):
result = np.empty(len(x), dtype=np.int)
result[0] = x[0]
for i, j in enumerate(x[1:], 1):
previous_outcome = result[i-1]
if previous_outcome > lim:
previous_outcome = 0
result[i] = previous_outcome + x[i]
return result
df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
quantity outcome outcome1
0 -1 -1.0 -1
1 -1 -2.0 -2
2 -1 -3.0 -3
3 -1 -4.0 -4
4 15 11.0 11
5 -1 -1.0 -1
6 -1 -2.0 -2
7 -1 -3.0 -3
8 -1 -4.0 -4
9 5 1.0 1
10 -1 -1.0 -1
11 15 14.0 14
12 -1 -1.0 -1
13 -1 -2.0 -2
14 -1 -3.0 -3
I have a dataframe from source data that resembles the following:
In[1]: df = pd.DataFrame({'test_group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'test_type': [np.nan,'memory', np.nan, np.nan, 'visual', np.nan, np.nan,
'auditory', np.nan]}
Out[1]:
test_group test_type
0 1 NaN
1 1 memory
2 1 NaN
3 2 NaN
4 2 visual
5 2 NaN
6 3 NaN
7 3 auditory
8 3 NaN
test_group represents the grouping of the rows, which represent a test. I need to replace the NaNs in column test_type in each test_group with the value of the row that is not a NaN, e.g. memory, visual, etc.
I've tried a variety of approaches including isolating the "real" value in test_type such as
In [4]: df.groupby('test_group')['test_type'].unique()
Out[4]:
test_group
1 [nan, memory]
2 [nan, visual]
3 [nan, auditory]
Easy enough, I can index into each row and pluck out the value I want. This seems to head in the right direction:
In [6]: df.groupby('test_group')['test_type'].unique().apply(lambda x: x[1])
Out[6]:
test_group
1 memory
2 visual
3 auditory
I tried this among many other things but it doesn't quite work (note: apply and transform give the same result):
In [15]: grp = df.groupby('test_group')
In [16]: df['test_type'] = grp['test_type'].unique().transform(lambda x: x[1])
In [17]: df
Out[17]:
test_group test_type
0 1 NaN
1 1 memory
2 1 visual
3 2 auditory
4 2 NaN
5 2 NaN
6 3 NaN
7 3 NaN
8 3 NaN
I'm sure if I looped it I'd be done with things, but loops are too slow as the data set is millions of records per file.
You can use GroupBy.size to get the size of each group. Then boolean index using Series.isna. Now, use Index.repeat with df.reindex
repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
timeit analysis:
Benchmarking dataframe:
df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001,
'test_type' : [np.nan]*10_000 + ['memory'] +
[np.nan]*10_000 + ['visual'] +
[np.nan]*10_000 + ['auditory']})
df.shape
# (30003, 2)
Results:
# Ch3steR's answer
In [54]: %%timeit
...: repeats = df.groupby('test_group').size()
...: out = df[~df['test_type'].isna()]
...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
...:
...:
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# timgeb's answer
In [55]: %%timeit
...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
...:
...:
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Almost ~4X faster. I believe it's because boolean indexing is very fast. And reindex + repeat is lightwieght compared to dual fillna.
Under the assumption that there's a unique non-nan value per group, the following should satisfy your request.
>>> df['test_type'] = df.groupby('test_group')['test_type'].ffill().bfill()
>>> df
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
edit:
The original answer used
df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
but it looks like according to schwim's timings ffill/bfill is significantly faster (for some reason).
I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!
So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.
counter = list(range(20))
df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]
I'm trying to create sums of the column counter, but only under the following conditions:
If the "Event" = True I want to sum the "Counter" values for the last 10 previous rows before the event happened.
EXCEPT if there is another Event within those 10 previous rows. In this case I only want to sum up the counter values between those two events (without exceeding 10 rows).
To clarify my goal this is the result I had in mind:
My attempt so far looks like this:
for index, row in df1.iterrows():
if row["Event"] == True:
counter = 1
summ = 0
while counter < 10 and row["Event"].shift(counter) == False:
summ += row["Counter"].shift(counter)
counter += 1
else:
df1.at[index, "Sum"] = summ
I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:
AttributeError: 'bool' object has no attribute 'shift'
Please shatter my believes and show me, that Excel isn't actually superior.
We need create a subgroup key with cumsum , then do rolling sum
n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
rolling(n+1,min_periods=1).sum().\
reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
Counter Event sum
0 0 True 0.0
1 1 False 0.0
2 2 False 0.0
3 3 False 0.0
4 4 False 0.0
5 5 False 0.0
6 6 True 15.0
7 7 False 0.0
8 8 False 0.0
9 9 False 0.0
10 10 False 0.0
11 11 False 0.0
12 12 False 0.0
13 13 False 0.0
14 14 False 0.0
15 15 False 0.0
16 16 False 0.0
17 17 False 0.0
18 18 False 0.0
19 19 True 135.0
Element-wise approach
You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.
The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:
def create_sum_column_loop(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
for index in range(df.shape[0]):
counter = 1
summ = 0
if df.loc[index, "Event"]: # == True is implied
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if df.loc[index - backup, "Event"]:
break
# increment by the counter
summ += df.loc[index - backup, "Counter"]
df.loc[index, "Sum"] = summ
return df
This does the job:
In [15]: df1_sum1 = create_sum_column(df1.copy()) # copy to preserve original
In [16]: df1_sum1
Counter Event Sum
0 0 True 0
1 1 False 0
2 2 False 0
3 3 False 0
4 4 False 0
5 5 False 0
6 6 True 15
7 7 False 0
8 8 False 0
9 9 False 0
10 10 False 0
11 11 False 0
12 12 False 0
13 13 False 0
14 14 False 0
15 15 False 0
16 16 False 0
17 17 False 0
18 18 False 0
19 19 True 135
Better: vectorized operations
However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum() does loop through the elements and create a running total, but it does it in C, making it lightning fast.
This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.
Borrowing the clever solution from #YOBEN_S:
def create_sum_column_vectorized(df):
n = 10
s = (
df.Counter
# group by a unique identifier for each event. This is a
# particularly clever bit, where #YOBEN_S reverses
# the order of df.Event, then computes a running total
.groupby(df.Event.iloc[::-1].cumsum())
# compute the rolling sum within each group
.rolling(n+1,min_periods=1).sum()
# drop the group index so we can align with the original DataFrame
.reset_index(level=0,drop=True)
# drop all non-event observations
.where(df.Event)
)
# remove the counter value for the actual event
# rows, then fill the remaining rows with 0s
df['sum'] = (s - df.Counter).fillna(0)
return df
We can see that the result is the same as the one above (though the values are suddenly floats):
In [23]: df1_sum2 = create_sum_column_vectorized(df1) # copy to preserve original
In [24]: df1_sum2
The difference comes in the performance. In ipython or jupyter we can use the %timeit command to see how long a statement takes to run:
In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.
For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:
In [27]: df_big = pd.DataFrame({
...: 'Counter': np.arange(100000),
...: 'Event': np.random.random(size=100000) > 0.9,
...: })
...:
Now, you can really see the performance benefit of the vectorized approach:
In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.
Compiling your own workflows with numba
Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba:
import numba
#numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
for index in range(len(counter)):
summ = 0
if event[index]:
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if event[index - backup]:
break
# increment by the counter
summ = summ + counter[index - backup]
sum_col[index] = summ
return sum_col
def create_sum_column_loop_jit(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
df["Sum"] = _inner_vectorized_loop(
df.Counter.values, df.Event.values, df.Sum.values)
return df
This beats both pandas and the for loop by a factor of more than 1000!
In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!