This question already has answers here:
vectorize conditional assignment in pandas dataframe
(5 answers)
Closed 19 days ago.
I have a dataframe containing some input values, and I trying to evaluate a parameter conditioned on the values given in a column (example below).
What I would like to obtain is shown in the figure:
How can I solve the issue below?
import numpy as np
import pandas as pd
df = pd.DataFrame.from_dict({
'x': [0,1,2,3,4],
'y': [100,100,100,100,100],
'z': [100,100,100,100,100],
})
def evaluate(input):
if input <=2:
a=4
b=6
else:
a=7
b=8
return df['x']*a+b*(df['y']+df['z'])
df['calc'] = evaluate(df['x'])
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> ~\AppData\Local\Temp/ipykernel_38760/3329748611.py in <module>
> 15 b=8
> 16 return df['x']*a+b*(df['y']+df['z'])
> 17 df['calc'] = evaluate(df['x'])
>
> ~\AppData\Local\Temp/ipykernel_38760/3329748611.py in evaluate(input)
> 8 })
> 9 def evaluate(input):
> 10 if input <=2:
> 11 a=4
> 12 b=6
>
> ~\anaconda3\lib\site-packages\pandas\core\generic.py in __nonzero__(self)
> 1535 #final
> 1536 def __nonzero__(self):
> 1537 raise ValueError(
> 1538 f"The truth value of a {type(self).__name__} is ambiguous. "
> 1539 "Use a.empty, a.bool(), a.item(), a.any() or a.all()."
>
> ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
The problem is your function is not vectorizing properly over the array df['x'], specifically your if statement if input <=2: cannot be applied to the column df['x'] generally. The better way to do this is with a loop, or df.apply(). Since df.apply() is trickier with multiple columns involved, the easiest solution is probably just create a function with a simple for loop using df.itertuples().
def evaluate(df):
calc_column = []
for row in df.itertuples():
if row.x <= 2:
a = 4
b = 6
else:
a = 7
b = 8
calc_column.append(row.x*a + b*(row.y + row.z))
return calc_column
df['calc'] = evaluate(df)
df
x y z calc
0 0 100 100 1200
1 1 100 100 1204
2 2 100 100 1208
3 3 100 100 1621
4 4 100 100 1628
Here each row element can be accessed from the df.itertuples() method, allowing you to easily grab the values you need from each row.
Use your conditions to create a mask. Use the mask to filter the rows you want to operate on. Return a Series.
def evaluate(df):
S = pd.Series(0,index=df.index)
mask = df['x'] <= 2
a,b = 4,6
S[mask] = df.loc[mask,'x']*a + b*(df.loc[mask,['y','z']].sum(1))
mask = df['x'] > 2
a,b = 7,8
S[mask] = df.loc[mask,'x']*a + b*(df.loc[mask,['y','z']].sum(1))
return S
df['calc'] = evaluate(df)
Or just...
def evaluate(df):
S = pd.Series(0,index=df.index)
mask = df['x'] <= 2
a,b = 4,6
S[mask] = df.loc[mask,'x']*a + b*(df.loc[mask,['y','z']].sum(1))
# mask = df['x'] > 2
a,b = 7,8
S[~mask] = df.loc[~mask,'x']*a + b*(df.loc[~mask,['y','z']].sum(1))
return S
I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.
I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).
It's almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.
From the dataframe below I need to calculate a new column based on the following spec in SQL:
CRITERIA
IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”
Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”
Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”
DATAFRAME
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined
0 MOST JEFF E 0 0 0 0 0 1 White
1 CRUISE TOM E 0 0 0 1 0 0 White
2 DEPP JOHNNY 0 0 0 0 0 1 Unknown
3 DICAP LEO 0 0 0 0 0 1 Unknown
4 BRANDO MARLON E 0 0 0 0 0 0 White
5 HANKS TOM 0 0 0 0 0 1 Unknown
6 DENIRO ROBERT E 0 1 0 0 0 1 White
7 PACINO AL E 0 0 0 0 0 1 White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White
9 EASTWOOD CLINT E 0 0 0 0 0 1 White
OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".
Next, use the apply function in pandas to apply the function - e.g.
df.apply (lambda row: label_race(row), axis=1)
Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.
df['race_label'] = df.apply (lambda row: label_race(row), axis=1)
The resultant dataframe looks like this (scroll to the right to see the new column):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
Since this is the first Google result for 'pandas new column from others', here's a simple example:
import pandas as pd
# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
# a b
# 0 1 3
# 1 2 4
# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0 4
# 1 6
# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
# a b c
# 0 1 3 4
# 1 2 4 6
If you get the SettingWithCopyWarning you can do it this way also:
fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'
Source: https://stackoverflow.com/a/12555510/243392
And if your column name includes spaces you can use syntax like this:
df = df.assign(**{'some column name': col.values})
And here's the documentation for apply, and assign.
The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:
First, define conditions:
conditions = [
df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1,
]
Now, define the corresponding outputs:
outputs = [
'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]
Finally, using numpy.select:
res = np.select(conditions, outputs, 'Other')
pd.Series(res)
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
Why should numpy.select be used over apply? Here are some performance checks:
df = pd.concat([df]*1000)
In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %%timeit
...: conditions = [
...: df['eri_hispanic'] == 1,
...: df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
...: df['eri_nat_amer'] == 1,
...: df['eri_asian'] == 1,
...: df['eri_afr_amer'] == 1,
...: df['eri_hawaiian'] == 1,
...: df['eri_white'] == 1,
...: ]
...:
...: outputs = [
...: 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
...: ]
...:
...: np.select(conditions, outputs, 'Other')
...:
...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.
.apply() takes in a function as the first parameter; pass in the label_race function as so:
df['race_label'] = df.apply(label_race, axis=1)
You don't need to make a lambda function to pass in a function.
try this,
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
O/P:
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \
0 MOST JEFF E 0 0 0
1 CRUISE TOM E 0 0 0
2 DEPP JOHNNY NaN 0 0 0
3 DICAP LEO NaN 0 0 0
4 BRANDO MARLON E 0 0 0
5 HANKS TOM NaN 0 0 0
6 DENIRO ROBERT E 0 1 0
7 PACINO AL E 0 0 0
8 WILLIAMS ROBIN E 0 0 1
9 EASTWOOD CLINT E 0 0 0
eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 0 0 1 White White
1 1 0 0 White Hispanic
2 0 0 1 Unknown White
3 0 0 1 Unknown White
4 0 0 0 White Other
5 0 0 1 Unknown White
6 0 0 1 White Two Or More
7 0 0 1 White White
8 0 0 0 White Haw/Pac Isl.
9 0 0 1 White White
use .loc instead of apply.
it improves vectorization.
.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.
for more details visit, .loc docs
Performance metrics:
Accepted Answer:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)
%timeit df.apply(lambda row: label_race(row), axis=1)
1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My Proposed Answer:
def label_race(df):
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)
%timeit label_race(df)
24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we inspect its source code, apply() is a syntactic sugar for a Python for-loop (via the apply_series_generator() method of the FrameApply class). Because it has the pandas overhead, it's generally slower than a Python loop.
Use optimized (vectorized) methods wherever possible. If you have to use a loop, use #numba.jit decorator.
1. Don't use apply() for an if-else ladder
df.apply() is just about the slowest way to do this in pandas. As shown in the answers of user3483203 and Mohamed Thasin ah, depending on the dataframe size, np.select() and df.loc may be 50-300 times faster than df.apply() to produce the same output.
As it happens, a loop implementation (not unlike apply()) with the #jit decorator from numba module is (about 50-60%) faster than df.loc and np.select.1
Numba works on numpy arrays, so before using the jit decorator, you need to convert the dataframe into a numpy array. Then fill in values in a pre-initialized empty array by checking the conditions in a loop. Since numpy arrays don't have column names, you have to access the columns by their index in the loop. The most inconvenient part of the if-else ladder in the jitted function over the one in apply() is accessing the columns by their indices. Otherwise it's almost the same implementation.
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1:
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1:
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1:
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
# the columns with the boolean data
cols = [c for c in df.columns if c.startswith('eri_')]
# initialize an empty array to be filled in a loop
# for string dtype arrays, we need to know the length of the longest string
# and use it to set the dtype
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
# pass the underlying numpy array of `df[cols]` into the jitted function
df['rno_defined'] = conditional_assignment(df[cols].values, res)
2. Don't use apply() for numeric operations
If you need to add a new row by adding two columns, your first instinct may be to write
df['c'] = df.apply(lambda row: row['a'] + row['b'], axis=1)
But instead of this, row-wise add using sum(axis=1) method (or + operator if there are only a couple of columns):
df['c'] = df[['a','b']].sum(axis=1)
# equivalently
df['c'] = df['a'] + df['b']
Depending on the dataframe size, sum(1) may be 100s of times faster than apply().
In fact, you will almost never need apply() for numeric operations on a pandas dataframe because it has optimized methods for most operations: addition (sum(1)), subtraction (sub() or diff()), multiplication (prod(1)), division (div() or /), power (pow()), >, >=, ==, %, //, &, | etc. can all be performed on the entire dataframe without apply().
For example, let's say you want to create a new column using the following rule:
IF [colC] > 0 THEN RETURN [colA] * [colB]
ELSE RETURN [colA] / [colB]
Using the optimized pandas methods, this can be written as
df['new'] = df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
the equivalent apply() solution is:
df['new'] = df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
The approach using the optimized methods is 250 times faster than the equivalent apply() approach for dataframes with 20k rows. This gap only increases as the data size increases (for a dataframe with 1 mil rows, it's 365 times faster) and the time difference will become more and more noticeable.2
1: In the below result, I show the performance of the three approaches using a dataframe with 24 mil rows (this is the largest frame I can construct on my machine). For smaller frames, the numba-jitted function consistently runs at least 50% faster than the other two as well (you can check yourself).
def pd_loc(df):
df['rno_defined'] = 'Other'
df.loc[df['eri_nat_amer'] == 1, 'rno_defined'] = 'A/I AK Native'
df.loc[df['eri_asian'] == 1, 'rno_defined'] = 'Asian'
df.loc[df['eri_afr_amer'] == 1, 'rno_defined'] = 'Black/AA'
df.loc[df['eri_hawaiian'] == 1, 'rno_defined'] = 'Haw/Pac Isl.'
df.loc[df['eri_white'] == 1, 'rno_defined'] = 'White'
df.loc[df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1) > 1, 'rno_defined'] = 'Two Or More'
df.loc[df['eri_hispanic'] == 1, 'rno_defined'] = 'Hispanic'
return df
def np_select(df):
conditions = [df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1]
outputs = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']
df['rno_defined'] = np.select(conditions, outputs, 'Other')
return df
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1 :
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1 :
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1 :
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
def nb_loop(df):
cols = [c for c in df.columns if c.startswith('eri_')]
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
df['rno_defined'] = conditional_assignment(df[cols].values, res)
return df
# df with 24mil rows
n = 4_000_000
df = pd.DataFrame({
'eri_afr_amer': [0, 0, 0, 0, 0, 0]*n,
'eri_asian': [1, 0, 0, 0, 0, 0]*n,
'eri_hawaiian': [0, 0, 0, 1, 0, 0]*n,
'eri_hispanic': [0, 1, 0, 0, 1, 0]*n,
'eri_nat_amer': [0, 0, 0, 0, 1, 0]*n,
'eri_white': [0, 0, 1, 1, 0, 0]*n
}, dtype='int8')
df.insert(0, 'name', ['MOST', 'CRUISE', 'DEPP', 'DICAP', 'BRANDO', 'HANKS']*n)
%timeit nb_loop(df)
# 5.23 s ± 45.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit pd_loc(df)
# 7.97 s ± 28.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit np_select(df)
# 8.5 s ± 39.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
2: In the below result, I show the performance of the two approaches using a dataframe with 20k rows and again with 1 mil rows. For smaller frames, the gap is smaller because the optimized approach has an overhead while apply() is a loop. As the size of the frame increases, the vectorization overhead cost diminishes w.r.t. to the overall runtime of the code while apply() remains a loop over the frame.
n = 20_000 # 1_000_000
df = pd.DataFrame(np.random.rand(n,3)-0.5, columns=['colA','colB','colC'])
%timeit df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
# n = 20000: 2.69 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# n = 1000000: 86.2 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
# n = 20000: 679 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# n = 1000000: 31.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Yet another (easily generalizable) approach, whose corner-stone is pandas.DataFrame.idxmax. First, the easily generalizable preamble.
# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'
# The "dictionary-based" if-else statements
labels = {
_gt_1_key : 'Two Or More',
'eri_hispanic': 'Hispanic',
'eri_nat_amer': 'A/I AK Native',
'eri_asian' : 'Asian',
'eri_afr_amer': 'Black/AA',
'eri_hawaiian': 'Haw/Pac Isl.',
'eri_white' : 'White',
_lt_1_key : 'Other',
}
# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy() # `~.copy` to avoid `SettingWithCopyWarning`
... and, finally, in a vectorized fashion:
mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label = mat.idxmax(axis=1).map(labels)
where
>>> race_label
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
that is a pandas.Series instance you can easily host within df, i.e. doing df['race_label'] = race_label.
Choosing a method according to the complexity of the criteria
For the examples below - in order to show multiple types of rules for the new column - we will assume a DataFrame with columns 'red', 'green' and 'blue', containing floating-point values ranging 0 to 1.
General case: .apply
As long as the necessary logic to compute the new value can be written as a function of other values in the same row, we can use the .apply method of the DataFrame to get the desired result. Write the function so that it accepts a single parameter, which is a single row of the input:
def as_hex(value):
# clamp to avoid rounding errors etc.
return min(max(0, int(value * 256)), 255)
def hex_color(row):
r, g, b = as_hex(row['red']), as_hex(row['green']), as_hex(row['blue'])
return f'#{r:02x}{g:02x}{b:02x}'
Pass the function itself (don't write parentheses after the name) to .apply, and specify axis=1 (meaning to supply rows to the categorizing function, so as to compute a column - rather than the other way around). Thus:
df['hex_color'] = df.apply(hex_color, axis=1)
Note that wrapping in lambda is not necessary, since we are not binding any arguments or otherwise modifying the function.
The .apply step is necessary because the conversion function itself is not vectorized. Thus, a naive approach like df['color'] = hex_color(df) will not work (example question).
This tool is powerful, but inefficient. For best performance, please use a more specific approach where applicable.
Multiple choices with conditions: numpy.select or repeated assignment with df.loc or df.where
Suppose we were thresholding the color values, and computing rough color names like so:
def additive_color(row):
# Insert here: logic that takes values from the `row` and computes
# the desired cell value for the new column in that row.
# The `row` is an ordinary `Series` object representing a row of the
# original `DataFrame`; it can be indexed with column names, thus:
if row['red'] > 0.5:
if row['green'] > 0.5:
return 'white' if row['blue'] > 0.5 else 'yellow'
else:
return 'magenta' if row['blue'] > 0.5 else 'red'
elif row['green'] > 0.5:
return 'cyan' if row['blue'] > 0.5 else 'green'
else:
return 'blue' if row['blue'] > 0.5 else 'black'
In cases like this - where the categorizing function would be an if/else ladder, or match/case in 3.10 and up - we may get much faster performance using numpy.select.
This approach works very differently. First, compute masks on the data for where each condition applies:
black = (df['red'] <= 0.5) & (df['green'] <= 0.5) & (df['blue'] <= 0.5)
white = (df['red'] > 0.5) & (df['green'] > 0.5) & (df['blue'] > 0.5)
To call numpy.select, we need two parallel sequences - one of the conditions, and another of the corresponding values:
df['color'] = np.select(
[white, black],
['white', 'black'],
'colorful'
)
The optional third argument specifies a value to use when none of the conditions are met. (As an exercise: fill in the remaining conditions, and try it without a third argument.)
A similar approach is to make repeated assignments based on each condition. Assign the default value first, and then use df.loc to assign specific values for each condition:
df['color'] = 'colorful'
df.loc[white, 'color'] = 'white'
df.loc[black, 'color'] = 'black'
Alternately, df.where can be used to do the assignments. However, df.where, used like this, assigns the specified value in places where the condition is not met, so the conditions must be inverted:
df['color'] = 'colorful'
df['color'] = df['color'].where(~white, 'white').where(~black, 'black')
Simple mathematical manipulations: built-in mathematical operators and broadcasting
For example, an apply-based approach like:
def brightness(row):
return row['red'] * .299 + row['green'] * .587 + row['blue'] * .114
df['brightness'] = df.apply(brightness, axis=1)
can instead be written by broadcasting the operators, for much better performance (and is also simpler):
df['brightness'] = df['red'] * .299 + df['green'] * .587 + df['blue'] * .114
As an exercise, here's the first example redone that way:
def as_hex(column):
scaled = (column * 256).astype(int)
clamped = scaled.where(scaled >= 0, 0).where(scaled <= 255, 255)
return clamped.apply(lambda i: f'{i:02x}')
df['hex_color'] = '#' + as_hex(df['red']) + as_hex(df['green']) + as_hex(df['blue'])
I was unable to find a vectorized equivalent to format the integer values as hex strings, so .apply is still used internally here - meaning that the full speed penalty still comes into play. Still, this demonstrates some general techniques.
For more details and examples, see cottontail's answer.