Related
I want to apply my custom function (it uses an if-else ladder) to these six columns (ERI_Hispanic, ERI_AmerInd_AKNatv, ERI_Asian, ERI_Black_Afr.Amer, ERI_HI_PacIsl, ERI_White) in each row of my dataframe.
I've tried different methods from other questions but still can't seem to find the right answer for my problem. The critical piece of this is that if the person is counted as Hispanic they can't be counted as anything else. Even if they have a "1" in another ethnicity column they still are counted as Hispanic not two or more races. Similarly, if the sum of all the ERI columns is greater than 1 they are counted as two or more races and can't be counted as a unique ethnicity(except for Hispanic).
It's almost like doing a for loop through each row and if each record meets a criterion they are added to one list and eliminated from the original.
From the dataframe below I need to calculate a new column based on the following spec in SQL:
CRITERIA
IF [ERI_Hispanic] = 1 THEN RETURN “Hispanic”
ELSE IF SUM([ERI_AmerInd_AKNatv] + [ERI_Asian] + [ERI_Black_Afr.Amer] + [ERI_HI_PacIsl] + [ERI_White]) > 1 THEN RETURN “Two or More”
ELSE IF [ERI_AmerInd_AKNatv] = 1 THEN RETURN “A/I AK Native”
ELSE IF [ERI_Asian] = 1 THEN RETURN “Asian”
ELSE IF [ERI_Black_Afr.Amer] = 1 THEN RETURN “Black/AA”
ELSE IF [ERI_HI_PacIsl] = 1 THEN RETURN “Haw/Pac Isl.”
ELSE IF [ERI_White] = 1 THEN RETURN “White”
Comment: If the ERI Flag for Hispanic is True (1), the employee is classified as “Hispanic”
Comment: If more than 1 non-Hispanic ERI Flag is true, return “Two or More”
DATAFRAME
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined
0 MOST JEFF E 0 0 0 0 0 1 White
1 CRUISE TOM E 0 0 0 1 0 0 White
2 DEPP JOHNNY 0 0 0 0 0 1 Unknown
3 DICAP LEO 0 0 0 0 0 1 Unknown
4 BRANDO MARLON E 0 0 0 0 0 0 White
5 HANKS TOM 0 0 0 0 0 1 Unknown
6 DENIRO ROBERT E 0 1 0 0 0 1 White
7 PACINO AL E 0 0 0 0 0 1 White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White
9 EASTWOOD CLINT E 0 0 0 0 0 1 White
OK, two steps to this - first is to write a function that does the translation you want - I've put an example together based on your pseudo-code:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
You may want to go over this, but it seems to do the trick - notice that the parameter going into the function is considered to be a Series object labelled "row".
Next, use the apply function in pandas to apply the function - e.g.
df.apply (lambda row: label_race(row), axis=1)
Note the axis=1 specifier, that means that the application is done at a row, rather than a column level. The results are here:
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
If you're happy with those results, then run it again, saving the results into a new column in your original dataframe.
df['race_label'] = df.apply (lambda row: label_race(row), axis=1)
The resultant dataframe looks like this (scroll to the right to see the new column):
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 MOST JEFF E 0 0 0 0 0 1 White White
1 CRUISE TOM E 0 0 0 1 0 0 White Hispanic
2 DEPP JOHNNY NaN 0 0 0 0 0 1 Unknown White
3 DICAP LEO NaN 0 0 0 0 0 1 Unknown White
4 BRANDO MARLON E 0 0 0 0 0 0 White Other
5 HANKS TOM NaN 0 0 0 0 0 1 Unknown White
6 DENIRO ROBERT E 0 1 0 0 0 1 White Two Or More
7 PACINO AL E 0 0 0 0 0 1 White White
8 WILLIAMS ROBIN E 0 0 1 0 0 0 White Haw/Pac Isl.
9 EASTWOOD CLINT E 0 0 0 0 0 1 White White
Since this is the first Google result for 'pandas new column from others', here's a simple example:
import pandas as pd
# make a simple dataframe
df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
df
# a b
# 0 1 3
# 1 2 4
# create an unattached column with an index
df.apply(lambda row: row.a + row.b, axis=1)
# 0 4
# 1 6
# do same but attach it to the dataframe
df['c'] = df.apply(lambda row: row.a + row.b, axis=1)
df
# a b c
# 0 1 3 4
# 1 2 4 6
If you get the SettingWithCopyWarning you can do it this way also:
fn = lambda row: row.a + row.b # define a function for the new column
col = df.apply(fn, axis=1) # get column data with an index
df = df.assign(c=col.values) # assign values to column 'c'
Source: https://stackoverflow.com/a/12555510/243392
And if your column name includes spaces you can use syntax like this:
df = df.assign(**{'some column name': col.values})
And here's the documentation for apply, and assign.
The answers above are perfectly valid, but a vectorized solution exists, in the form of numpy.select. This allows you to define conditions, then define outputs for those conditions, much more efficiently than using apply:
First, define conditions:
conditions = [
df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1,
]
Now, define the corresponding outputs:
outputs = [
'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
]
Finally, using numpy.select:
res = np.select(conditions, outputs, 'Other')
pd.Series(res)
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
Why should numpy.select be used over apply? Here are some performance checks:
df = pd.concat([df]*1000)
In [42]: %timeit df.apply(lambda row: label_race(row), axis=1)
1.07 s ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [44]: %%timeit
...: conditions = [
...: df['eri_hispanic'] == 1,
...: df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
...: df['eri_nat_amer'] == 1,
...: df['eri_asian'] == 1,
...: df['eri_afr_amer'] == 1,
...: df['eri_hawaiian'] == 1,
...: df['eri_white'] == 1,
...: ]
...:
...: outputs = [
...: 'Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White'
...: ]
...:
...: np.select(conditions, outputs, 'Other')
...:
...:
3.09 ms ± 17 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Using numpy.select gives us vastly improved performance, and the discrepancy will only increase as the data grows.
.apply() takes in a function as the first parameter; pass in the label_race function as so:
df['race_label'] = df.apply(label_race, axis=1)
You don't need to make a lambda function to pass in a function.
try this,
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
O/P:
lname fname rno_cd eri_afr_amer eri_asian eri_hawaiian \
0 MOST JEFF E 0 0 0
1 CRUISE TOM E 0 0 0
2 DEPP JOHNNY NaN 0 0 0
3 DICAP LEO NaN 0 0 0
4 BRANDO MARLON E 0 0 0
5 HANKS TOM NaN 0 0 0
6 DENIRO ROBERT E 0 1 0
7 PACINO AL E 0 0 0
8 WILLIAMS ROBIN E 0 0 1
9 EASTWOOD CLINT E 0 0 0
eri_hispanic eri_nat_amer eri_white rno_defined race_label
0 0 0 1 White White
1 1 0 0 White Hispanic
2 0 0 1 Unknown White
3 0 0 1 Unknown White
4 0 0 0 White Other
5 0 0 1 Unknown White
6 0 0 1 White Two Or More
7 0 0 1 White White
8 0 0 0 White Haw/Pac Isl.
9 0 0 1 White White
use .loc instead of apply.
it improves vectorization.
.loc works in simple manner, mask rows based on the condition, apply values to the freeze rows.
for more details visit, .loc docs
Performance metrics:
Accepted Answer:
def label_race (row):
if row['eri_hispanic'] == 1 :
return 'Hispanic'
if row['eri_afr_amer'] + row['eri_asian'] + row['eri_hawaiian'] + row['eri_nat_amer'] + row['eri_white'] > 1 :
return 'Two Or More'
if row['eri_nat_amer'] == 1 :
return 'A/I AK Native'
if row['eri_asian'] == 1:
return 'Asian'
if row['eri_afr_amer'] == 1:
return 'Black/AA'
if row['eri_hawaiian'] == 1:
return 'Haw/Pac Isl.'
if row['eri_white'] == 1:
return 'White'
return 'Other'
df=pd.read_csv('dataser.csv')
df = pd.concat([df]*1000)
%timeit df.apply(lambda row: label_race(row), axis=1)
1.15 s ± 46.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
My Proposed Answer:
def label_race(df):
df.loc[df['eri_white']==1,'race_label'] = 'White'
df.loc[df['eri_hawaiian']==1,'race_label'] = 'Haw/Pac Isl.'
df.loc[df['eri_afr_amer']==1,'race_label'] = 'Black/AA'
df.loc[df['eri_asian']==1,'race_label'] = 'Asian'
df.loc[df['eri_nat_amer']==1,'race_label'] = 'A/I AK Native'
df.loc[(df['eri_afr_amer'] + df['eri_asian'] + df['eri_hawaiian'] + df['eri_nat_amer'] + df['eri_white']) > 1,'race_label'] = 'Two Or More'
df.loc[df['eri_hispanic']==1,'race_label'] = 'Hispanic'
df['race_label'].fillna('Other', inplace=True)
df=pd.read_csv('s22.csv')
df = pd.concat([df]*1000)
%timeit label_race(df)
24.7 ms ± 1.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If we inspect its source code, apply() is a syntactic sugar for a Python for-loop (via the apply_series_generator() method of the FrameApply class). Because it has the pandas overhead, it's generally slower than a Python loop.
Use optimized (vectorized) methods wherever possible. If you have to use a loop, use #numba.jit decorator.
1. Don't use apply() for an if-else ladder
df.apply() is just about the slowest way to do this in pandas. As shown in the answers of user3483203 and Mohamed Thasin ah, depending on the dataframe size, np.select() and df.loc may be 50-300 times faster than df.apply() to produce the same output.
As it happens, a loop implementation (not unlike apply()) with the #jit decorator from numba module is (about 50-60%) faster than df.loc and np.select.1
Numba works on numpy arrays, so before using the jit decorator, you need to convert the dataframe into a numpy array. Then fill in values in a pre-initialized empty array by checking the conditions in a loop. Since numpy arrays don't have column names, you have to access the columns by their index in the loop. The most inconvenient part of the if-else ladder in the jitted function over the one in apply() is accessing the columns by their indices. Otherwise it's almost the same implementation.
import numpy as np
import numba as nb
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1:
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1:
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1:
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
# the columns with the boolean data
cols = [c for c in df.columns if c.startswith('eri_')]
# initialize an empty array to be filled in a loop
# for string dtype arrays, we need to know the length of the longest string
# and use it to set the dtype
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
# pass the underlying numpy array of `df[cols]` into the jitted function
df['rno_defined'] = conditional_assignment(df[cols].values, res)
2. Don't use apply() for numeric operations
If you need to add a new row by adding two columns, your first instinct may be to write
df['c'] = df.apply(lambda row: row['a'] + row['b'], axis=1)
But instead of this, row-wise add using sum(axis=1) method (or + operator if there are only a couple of columns):
df['c'] = df[['a','b']].sum(axis=1)
# equivalently
df['c'] = df['a'] + df['b']
Depending on the dataframe size, sum(1) may be 100s of times faster than apply().
In fact, you will almost never need apply() for numeric operations on a pandas dataframe because it has optimized methods for most operations: addition (sum(1)), subtraction (sub() or diff()), multiplication (prod(1)), division (div() or /), power (pow()), >, >=, ==, %, //, &, | etc. can all be performed on the entire dataframe without apply().
For example, let's say you want to create a new column using the following rule:
IF [colC] > 0 THEN RETURN [colA] * [colB]
ELSE RETURN [colA] / [colB]
Using the optimized pandas methods, this can be written as
df['new'] = df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
the equivalent apply() solution is:
df['new'] = df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
The approach using the optimized methods is 250 times faster than the equivalent apply() approach for dataframes with 20k rows. This gap only increases as the data size increases (for a dataframe with 1 mil rows, it's 365 times faster) and the time difference will become more and more noticeable.2
1: In the below result, I show the performance of the three approaches using a dataframe with 24 mil rows (this is the largest frame I can construct on my machine). For smaller frames, the numba-jitted function consistently runs at least 50% faster than the other two as well (you can check yourself).
def pd_loc(df):
df['rno_defined'] = 'Other'
df.loc[df['eri_nat_amer'] == 1, 'rno_defined'] = 'A/I AK Native'
df.loc[df['eri_asian'] == 1, 'rno_defined'] = 'Asian'
df.loc[df['eri_afr_amer'] == 1, 'rno_defined'] = 'Black/AA'
df.loc[df['eri_hawaiian'] == 1, 'rno_defined'] = 'Haw/Pac Isl.'
df.loc[df['eri_white'] == 1, 'rno_defined'] = 'White'
df.loc[df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1) > 1, 'rno_defined'] = 'Two Or More'
df.loc[df['eri_hispanic'] == 1, 'rno_defined'] = 'Hispanic'
return df
def np_select(df):
conditions = [df['eri_hispanic'] == 1,
df[['eri_afr_amer', 'eri_asian', 'eri_hawaiian', 'eri_nat_amer', 'eri_white']].sum(1).gt(1),
df['eri_nat_amer'] == 1,
df['eri_asian'] == 1,
df['eri_afr_amer'] == 1,
df['eri_hawaiian'] == 1,
df['eri_white'] == 1]
outputs = ['Hispanic', 'Two Or More', 'A/I AK Native', 'Asian', 'Black/AA', 'Haw/Pac Isl.', 'White']
df['rno_defined'] = np.select(conditions, outputs, 'Other')
return df
#nb.jit(nopython=True)
def conditional_assignment(arr, res):
length = len(arr)
for i in range(length):
if arr[i][3] == 1 :
res[i] = 'Hispanic'
elif arr[i][0] + arr[i][1] + arr[i][2] + arr[i][4] + arr[i][5] > 1 :
res[i] = 'Two Or More'
elif arr[i][0] == 1:
res[i] = 'Black/AA'
elif arr[i][1] == 1:
res[i] = 'Asian'
elif arr[i][2] == 1:
res[i] = 'Haw/Pac Isl.'
elif arr[i][4] == 1 :
res[i] = 'A/I AK Native'
elif arr[i][5] == 1:
res[i] = 'White'
else:
res[i] = 'Other'
return res
def nb_loop(df):
cols = [c for c in df.columns if c.startswith('eri_')]
res = np.empty(len(df), dtype=f"<U{len('A/I AK Native')}")
df['rno_defined'] = conditional_assignment(df[cols].values, res)
return df
# df with 24mil rows
n = 4_000_000
df = pd.DataFrame({
'eri_afr_amer': [0, 0, 0, 0, 0, 0]*n,
'eri_asian': [1, 0, 0, 0, 0, 0]*n,
'eri_hawaiian': [0, 0, 0, 1, 0, 0]*n,
'eri_hispanic': [0, 1, 0, 0, 1, 0]*n,
'eri_nat_amer': [0, 0, 0, 0, 1, 0]*n,
'eri_white': [0, 0, 1, 1, 0, 0]*n
}, dtype='int8')
df.insert(0, 'name', ['MOST', 'CRUISE', 'DEPP', 'DICAP', 'BRANDO', 'HANKS']*n)
%timeit nb_loop(df)
# 5.23 s ± 45.2 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit pd_loc(df)
# 7.97 s ± 28.8 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
%timeit np_select(df)
# 8.5 s ± 39.6 ms per loop (mean ± std. dev. of 10 runs, 10 loops each)
2: In the below result, I show the performance of the two approaches using a dataframe with 20k rows and again with 1 mil rows. For smaller frames, the gap is smaller because the optimized approach has an overhead while apply() is a loop. As the size of the frame increases, the vectorization overhead cost diminishes w.r.t. to the overall runtime of the code while apply() remains a loop over the frame.
n = 20_000 # 1_000_000
df = pd.DataFrame(np.random.rand(n,3)-0.5, columns=['colA','colB','colC'])
%timeit df[['colA','colB']].prod(1).where(df['colC']>0, df['colA'] / df['colB'])
# n = 20000: 2.69 ms ± 23.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# n = 1000000: 86.2 ms ± 441 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.apply(lambda row: row.colA * row.colB if row.colC > 0 else row.colA / row.colB, axis=1)
# n = 20000: 679 ms ± 33.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# n = 1000000: 31.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Yet another (easily generalizable) approach, whose corner-stone is pandas.DataFrame.idxmax. First, the easily generalizable preamble.
# Indeed, all your conditions boils down to the following
_gt_1_key = 'two_or_more'
_lt_1_key = 'other'
# The "dictionary-based" if-else statements
labels = {
_gt_1_key : 'Two Or More',
'eri_hispanic': 'Hispanic',
'eri_nat_amer': 'A/I AK Native',
'eri_asian' : 'Asian',
'eri_afr_amer': 'Black/AA',
'eri_hawaiian': 'Haw/Pac Isl.',
'eri_white' : 'White',
_lt_1_key : 'Other',
}
# The output-driving 1-0 matrix
mat = df.filter(regex='^eri_').copy() # `~.copy` to avoid `SettingWithCopyWarning`
... and, finally, in a vectorized fashion:
mat[_gt_1_key] = gt1 = mat.sum(axis=1)
mat[_lt_1_key] = gt1.eq(0).astype(int)
race_label = mat.idxmax(axis=1).map(labels)
where
>>> race_label
0 White
1 Hispanic
2 White
3 White
4 Other
5 White
6 Two Or More
7 White
8 Haw/Pac Isl.
9 White
dtype: object
that is a pandas.Series instance you can easily host within df, i.e. doing df['race_label'] = race_label.
Choosing a method according to the complexity of the criteria
For the examples below - in order to show multiple types of rules for the new column - we will assume a DataFrame with columns 'red', 'green' and 'blue', containing floating-point values ranging 0 to 1.
General case: .apply
As long as the necessary logic to compute the new value can be written as a function of other values in the same row, we can use the .apply method of the DataFrame to get the desired result. Write the function so that it accepts a single parameter, which is a single row of the input:
def as_hex(value):
# clamp to avoid rounding errors etc.
return min(max(0, int(value * 256)), 255)
def hex_color(row):
r, g, b = as_hex(row['red']), as_hex(row['green']), as_hex(row['blue'])
return f'#{r:02x}{g:02x}{b:02x}'
Pass the function itself (don't write parentheses after the name) to .apply, and specify axis=1 (meaning to supply rows to the categorizing function, so as to compute a column - rather than the other way around). Thus:
df['hex_color'] = df.apply(hex_color, axis=1)
Note that wrapping in lambda is not necessary, since we are not binding any arguments or otherwise modifying the function.
The .apply step is necessary because the conversion function itself is not vectorized. Thus, a naive approach like df['color'] = hex_color(df) will not work (example question).
This tool is powerful, but inefficient. For best performance, please use a more specific approach where applicable.
Multiple choices with conditions: numpy.select or repeated assignment with df.loc or df.where
Suppose we were thresholding the color values, and computing rough color names like so:
def additive_color(row):
# Insert here: logic that takes values from the `row` and computes
# the desired cell value for the new column in that row.
# The `row` is an ordinary `Series` object representing a row of the
# original `DataFrame`; it can be indexed with column names, thus:
if row['red'] > 0.5:
if row['green'] > 0.5:
return 'white' if row['blue'] > 0.5 else 'yellow'
else:
return 'magenta' if row['blue'] > 0.5 else 'red'
elif row['green'] > 0.5:
return 'cyan' if row['blue'] > 0.5 else 'green'
else:
return 'blue' if row['blue'] > 0.5 else 'black'
In cases like this - where the categorizing function would be an if/else ladder, or match/case in 3.10 and up - we may get much faster performance using numpy.select.
This approach works very differently. First, compute masks on the data for where each condition applies:
black = (df['red'] <= 0.5) & (df['green'] <= 0.5) & (df['blue'] <= 0.5)
white = (df['red'] > 0.5) & (df['green'] > 0.5) & (df['blue'] > 0.5)
To call numpy.select, we need two parallel sequences - one of the conditions, and another of the corresponding values:
df['color'] = np.select(
[white, black],
['white', 'black'],
'colorful'
)
The optional third argument specifies a value to use when none of the conditions are met. (As an exercise: fill in the remaining conditions, and try it without a third argument.)
A similar approach is to make repeated assignments based on each condition. Assign the default value first, and then use df.loc to assign specific values for each condition:
df['color'] = 'colorful'
df.loc[white, 'color'] = 'white'
df.loc[black, 'color'] = 'black'
Alternately, df.where can be used to do the assignments. However, df.where, used like this, assigns the specified value in places where the condition is not met, so the conditions must be inverted:
df['color'] = 'colorful'
df['color'] = df['color'].where(~white, 'white').where(~black, 'black')
Simple mathematical manipulations: built-in mathematical operators and broadcasting
For example, an apply-based approach like:
def brightness(row):
return row['red'] * .299 + row['green'] * .587 + row['blue'] * .114
df['brightness'] = df.apply(brightness, axis=1)
can instead be written by broadcasting the operators, for much better performance (and is also simpler):
df['brightness'] = df['red'] * .299 + df['green'] * .587 + df['blue'] * .114
As an exercise, here's the first example redone that way:
def as_hex(column):
scaled = (column * 256).astype(int)
clamped = scaled.where(scaled >= 0, 0).where(scaled <= 255, 255)
return clamped.apply(lambda i: f'{i:02x}')
df['hex_color'] = '#' + as_hex(df['red']) + as_hex(df['green']) + as_hex(df['blue'])
I was unable to find a vectorized equivalent to format the integer values as hex strings, so .apply is still used internally here - meaning that the full speed penalty still comes into play. Still, this demonstrates some general techniques.
For more details and examples, see cottontail's answer.
i'm doing a matrix calculation using pandas in python.
my raw data is in the form of list of strings(which is unique for each row).
id list_of_value
0 ['a','b','c']
1 ['d','b','c']
2 ['a','b','c']
3 ['a','b','c']
i have to do a calculate a score with one row and against all the other rows
score calculation algorithm:
Step 1: Take value of id 0: ['a','b','c'],
Step 2: find the intersection between id 0 and id 1 ,
resultant = ['b','c']
Step 3: Score Calculation => resultant.size / id(0).size
repeat step 2,3 between id 0 and id 1,2,3, similarly for all the ids.
Create N * N matrix:
- 0 1 2 3
0 1 0.6 1 1
1 0.6 1 1 1
2 1 1 1 1
3 1 1 1 1
At present i'm using the pandas dummies approach to calculate the score:
s = pd.get_dummies(df.list_of_value.explode()).sum(level=0)
s.dot(s.T).div(s.sum(1))
but there is an repetition in calculation after the diagonal of the matrix, the score calculation till diagonal is sufficient. for eg:
calculation of score of ID 0, will be only till ID(row,column) (0,0), score for ID(row,column) (0,1),(0,2),(0,3) can be copied from ID(row,column) (1,0),(2,0),(3,0).
Detail on the calculation:
i need to calculate till the diagonal, that is till the yellow colored box(the diagonal of matrix), the white values are already calculated in the green shaded area (for ref), i just have to transpose the green shaded area to white.
how can i do this in pandas?
First of all here is a profiling of your code. First all commands separately, and then as you posted it.
%timeit df.list_of_value.explode()
%timeit pd.get_dummies(s)
%timeit s.sum(level=0)
%timeit s.dot(s.T)
%timeit s.sum(1)
%timeit s2.div(s3)
The above profiling returned the following results:
Explode : 1000 loops, best of 3: 201 µs per loop
Dummies : 1000 loops, best of 3: 697 µs per loop
Sum : 1000 loops, best of 3: 1.36 ms per loop
Dot : 1000 loops, best of 3: 453 µs per loop
Sum2 : 10000 loops, best of 3: 162 µs per loop
Divide : 100 loops, best of 3: 1.81 ms per loop
Running Your two lines together results in:
100 loops, best of 3: 5.35 ms per loop
Using a different approach relying less on the (sometimes expensive) functionality of pandas, the code I created takes just about a third of the time by skipping the calculation for the upper triangular matrix and the diagonal as well.
import numpy as np
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
for i in range(len(df)):
d0 = set(df.iloc[i].list_of_value)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(df)):
df2[j, i] = len(d0.intersection(df.iloc[j].list_of_value)) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(df))])
With df given as
df = pd.DataFrame(
[[['a','b','c']],
[['d','b','c']],
[['a','b','c']],
[['a','b','c']]],
columns = ["list_of_value"])
the profiling for this code results in a running time of only 1.68ms.
1000 loops, best of 3: 1.68 ms per loop
UPDATE
Instead of operating on the entire DataFrame, just picking the Series that is needed gives a huge speedup.
Three methods to iterate over the entries in the Series have been tested, and all of them are more or less equal regarding the performance.
%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(df), len(df)))
# get the Series from the DataFrame
dfl = df.list_of_value
for i, d0 in enumerate(dfl.values):
# for i, d0 in dfl.iteritems(): # in terms of performance about equal to the line above
# for i in range(len(dfl)): # slightly less performant than enumerate(dfl.values)
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl.iloc[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
There are a lot of pitfalls with pandas. E.g. always access the rows of a DataFrame or Series via df.iloc[0] instead of df[0]. Both works but df.iloc[0] is much faster.
The timings for the first matrix with 4 elements each with a list of size 3 resulted in a speedup of about 3 times as fast.
1000 loops, best of 3: 443 µs per loop
And when using a bigger dataset I got far better results with a speedup of over 11:
# operating on the DataFrame
10 loop, best of 3: 565 ms per loop
# operating on the Series
10 loops, best of 3: 47.7 ms per loop
UPDATE 2
When not using pandas at all (during the calculation), you get another significant speedup. Therefore you simply need to convert the column to operate on into a list.
%%timeit df = pd.DataFrame([[['a','b','c']], [['d','b','c']], [['a','b','c']], [['a','b','c']]], columns = ["list_of_value"])
# %%timeit df = pd.DataFrame([[random.choices(list("abcdefghijklmnopqrstuvwxyz"), k = 15)] for _ in range(100)], columns = ["list_of_value"])
# convert the column of the DataFrame to a list
dfl = list(df.list_of_value)
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))
for i, d0 in enumerate(dfl):
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
On the data provided in the question we only see a slightly better result compared to the first update.
1000 loops, best of 3: 363 µs per loop
But when using bigger data (100 rows with lists of size 15) the advantage gets obvious:
100 loops, best of 3: 5.26 ms per loop
Here a comparison of all the suggested methods:
+----------+-----------------------------------------+
| | Using the Dataset from the question |
+----------+-----------------------------------------+
| Question | 100 loops, best of 3: 4.63 ms per loop |
+----------+-----------------------------------------+
| Answer | 1000 loops, best of 3: 1.59 ms per loop |
+----------+-----------------------------------------+
| Update 1 | 1000 loops, best of 3: 447 µs per loop |
+----------+-----------------------------------------+
| Update 2 | 1000 loops, best of 3: 362 µs per loop |
+----------+-----------------------------------------+
Although this question is well answered I will show a more readable and also very efficient alternative:
from itertools import product
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
product(df['list_of_value'], repeat=2)))
pd.DataFrame(index=df['id'],
columns=df['id'],
data=np.array(values).reshape(len_df, len_df))
id 0 1 2 3
id
0 1.000000 0.666667 1.000000 1.000000
1 0.666667 1.000000 0.666667 0.666667
2 1.000000 0.666667 1.000000 1.000000
3 1.000000 0.666667 1.000000 1.000000
%%timeit
len_df = df.shape[0]
values = tuple(map(lambda comb: np.isin(*comb).sum() / len(comb[0]),
product(df['list_of_value'], repeat=2)))
pd.DataFrame(index=df['id'],
columns=df['id'],
data=np.array(values).reshape(len_df, len_df))
850 µs ± 18.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
#convert the column of the DataFrame to a list
dfl = list(df.list_of_value)
# create a matrix filled with ones (thus the diagonal is already filled with ones)
df2 = np.ones(shape = (len(dfl), len(dfl)))
for i, d0 in enumerate(dfl):
d0 = set(d0)
d0_len = len(d0)
# the inner loop starts at i+1 because we don't need to calculate the diagonal
for j in range(i + 1, len(dfl)):
df2[j, i] = len(d0.intersection(dfl[j])) / d0_len
# copy the lower triangular matrix to the upper triangular matrix
df2[np.mask_indices(len(df2), np.triu)] = df2.T[np.mask_indices(len(df2), np.triu)]
# create a DataFrame from the numpy array with the column names set to score<id>
df2 = pd.DataFrame(df2, columns = [f"score{i}" for i in range(len(dfl))])
470 µs ± 79.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am not inclined to change your first line, although I'm sure it could be faster, because it's not going to be the bottleneck as your data gets larger. But the second line could be, and is also extremely easy to improve:
Change this:
s.dot(s.T).div(s.sum(1))
To:
arr=s.values
np.dot( arr, arr.T ) / arr[0].sum()
That's just doing it in numpy instead of pandas, but often you'll get a huge speedup. On your small, sample data it will only speed up by 2x, but if you increase your dataframe from 4 rows to 400 rows, then I see a speedup of over 20x.
As an aside, I would be inclined to not worry about the triangular aspect of the problem, at least as far as speed. You have to make the code considerably more complex and you probably aren't even gaining any speed in a situation like this.
Conversely, if conserving storage space is important, then obviously retaining only the upper (or lower) triangle will cut your storage needs by slightly more than half.
(If you really do care about the triangular aspect for dimensionality numpy does have related functions/methods but I don't know them offhand and, again, it's not clear to me if it's worth the extra complexity in this case.)
For example, if there are 5 numbers 1, 2, 3, 4, 5
I want a random result like
[[ 2, 3, 1, 4, 5]
[ 5, 1, 2, 3, 4]
[ 3, 2, 4, 5, 1]
[ 1, 4, 5, 2, 3]
[ 4, 5, 3, 1, 2]]
Ensure every number is unique in its row and column.
Is there an efficient way to do this?
I've tried to use while loops to generate one row for each iteration, but it seems not so efficient.
import numpy as np
numbers = list(range(1,6))
result = np.zeros((5,5), dtype='int32')
row_index = 0
while row_index < 5:
np.random.shuffle(numbers)
for column_index, number in enumerate(numbers):
if number in result[:, column_index]:
break
else:
result[row_index, :] = numbers
row_index += 1
Just for your information, what you are looking for is a way of generating latin squares.
As for the solution, it depends on how much random "random" is for you.
I would devise at least four main techniques, two of which have been already proposed.
Hence, I will briefly describe the other two:
loop through all possible permutations of the items and accept the first that satisfy the unicity constraint along rows
use only cyclic permutations to build subsequent rows: these are by construction satisfying the unicity constraint along rows (the cyclic transformation can be done forward or backward); for improved "randomness" the rows can be shuffled
Assuming we work with standard Python data types since I do not see a real merit in using NumPy (but results can be easily converted to np.ndarray if necessary), this would be in code (the first function is just to check that the solution is actually correct):
import random
import math
import itertools
# this only works for Iterable[Iterable]
def is_latin_rectangle(rows):
valid = True
for row in rows:
if len(set(row)) < len(row):
valid = False
if valid and rows:
for i, val in enumerate(rows[0]):
col = [row[i] for row in rows]
if len(set(col)) < len(col):
valid = False
break
return valid
def is_latin_square(rows):
return is_latin_rectangle(rows) and len(rows) == len(rows[0])
# : prepare the input
n = 9
items = list(range(1, n + 1))
# shuffle items
random.shuffle(items)
# number of permutations
print(math.factorial(n))
def latin_square1(items, shuffle=True):
result = []
for elems in itertools.permutations(items):
valid = True
for i, elem in enumerate(elems):
orthogonals = [x[i] for x in result] + [elem]
if len(set(orthogonals)) < len(orthogonals):
valid = False
break
if valid:
result.append(elems)
if shuffle:
random.shuffle(result)
return result
rows1 = latin_square1(items)
for row in rows1:
print(row)
print(is_latin_square(rows1))
def latin_square2(items, shuffle=True, forward=False):
sign = -1 if forward else 1
result = [items[sign * i:] + items[:sign * i] for i in range(len(items))]
if shuffle:
random.shuffle(result)
return result
rows2 = latin_square2(items)
for row in rows2:
print(row)
print(is_latin_square(rows2))
rows2b = latin_square2(items, False)
for row in rows2b:
print(row)
print(is_latin_square(rows2b))
For comparison, an implementation by trying random permutations and accepting valid ones (fundamentally what #hpaulj proposed) is also presented.
def latin_square3(items):
result = [list(items)]
while len(result) < len(items):
new_row = list(items)
random.shuffle(new_row)
result.append(new_row)
if not is_latin_rectangle(result):
result = result[:-1]
return result
rows3 = latin_square3(items)
for row in rows3:
print(row)
print(is_latin_square(rows3))
I did not have time (yet) to implement the other method (with backtrack Sudoku-like solutions from #ConfusedByCode).
With timings for n = 5:
%timeit latin_square1(items)
321 µs ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit latin_square2(items)
7.5 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square2(items, False)
2.21 µs ± 69.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square3(items)
2.15 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
... and for n = 9:
%timeit latin_square1(items)
895 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit latin_square2(items)
12.5 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square2(items, False)
3.55 µs ± 55.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square3(items)
The slowest run took 36.54 times longer than the fastest. This could mean that an intermediate result is being cached.
9.76 s ± 9.23 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, solution 1 is giving a fair deal of randomness but it is not terribly fast (and scale with O(n!)), solution 2 (and 2b) are much faster (scaling with O(n)) but not as random as solution 1. Solution 3 is very slow and the performance can vary significantly (can probably be sped up by letting the last iteration be computed instead of guessed).
Getting more technical, other efficient algorithms are discussed in:
Jacobson, M. T. and Matthews, P. (1996), Generating uniformly distributed random latin squares. J. Combin. Designs, 4: 405-437. doi:10.1002/(SICI)1520-6610(1996)4:6<405::AID-JCD3>3.0.CO;2-J
This may seem odd, but you have basically described generating a random n-dimension Sudoku puzzle. From a blog post by Daniel Beer:
The basic approach to solving a Sudoku puzzle is by a backtracking search of candidate values for each cell. The general procedure is as follows:
Generate, for each cell, a list of candidate values by starting with the set of all possible values and eliminating those which appear in the same row, column and box as the cell being examined.
Choose one empty cell. If none are available, the puzzle is solved.
If the cell has no candidate values, the puzzle is unsolvable.
For each candidate value in that cell, place the value in the cell and try to recursively solve the puzzle.
There are two optimizations which greatly improve the performance of this algorithm:
When choosing a cell, always pick the one with the fewest candidate values. This reduces the branching factor. As values are added to the grid, the number of candidates for other cells reduces too.
When analysing the candidate values for empty cells, it's much quicker to start with the analysis of the previous step and modify it by removing values along the row, column and box of the last-modified cell. This is O(N) in the size of the puzzle, whereas analysing from scratch is O(N3).
In your case an "unsolvable puzzle" is an invalid matrix. Every element in the matrix will be unique on both axis in a solvable puzzle.
I experimented with a brute-force random choice. Generate a row, and if valid, add to the accumulated lines:
def foo(n=5,maxi=200):
arr = np.random.choice(numbers,n, replace=False)[None,:]
for i in range(maxi):
row = np.random.choice(numbers,n, replace=False)[None,:]
if (arr==row).any(): continue
arr = np.concatenate((arr, row),axis=0)
if arr.shape[0]==n: break
print(i)
return arr
Some sample runs:
In [66]: print(foo())
199
[[1 5 4 2 3]
[4 1 5 3 2]
[5 3 2 1 4]
[2 4 3 5 1]]
In [67]: print(foo())
100
[[4 2 3 1 5]
[1 4 5 3 2]
[5 1 2 4 3]
[3 5 1 2 4]
[2 3 4 5 1]]
In [68]: print(foo())
57
[[1 4 5 3 2]
[2 1 3 4 5]
[3 5 4 2 1]
[5 3 2 1 4]
[4 2 1 5 3]]
In [69]: print(foo())
174
[[2 1 5 4 3]
[3 4 1 2 5]
[1 3 2 5 4]
[4 5 3 1 2]
[5 2 4 3 1]]
In [76]: print(foo())
41
[[3 4 5 1 2]
[1 5 2 3 4]
[5 2 3 4 1]
[2 1 4 5 3]
[4 3 1 2 5]]
The required number of tries varies all over the place, with some exceeding my iteration limit.
Without getting into any theory, there's going to be difference between quickly generating a 2d permutation, and generating one that is in some sense or other, maximally random. I suspect my approach is closer to this random goal than a more systematic and efficient approach (but I can't prove it).
def opFoo():
numbers = list(range(1,6))
result = np.zeros((5,5), dtype='int32')
row_index = 0; i = 0
while row_index < 5:
np.random.shuffle(numbers)
for column_index, number in enumerate(numbers):
if number in result[:, column_index]:
break
else:
result[row_index, :] = numbers
row_index += 1
i += 1
return i, result
In [125]: opFoo()
Out[125]:
(11, array([[2, 3, 1, 5, 4],
[4, 5, 1, 2, 3],
[3, 1, 2, 4, 5],
[1, 3, 5, 4, 2],
[5, 3, 4, 2, 1]]))
Mine is quite a bit slower than the OP's, but mine is correct.
This is an improvement on mine (2x faster):
def foo1(n=5,maxi=300):
numbers = np.arange(1,n+1)
np.random.shuffle(numbers)
arr = numbers.copy()[None,:]
for i in range(maxi):
np.random.shuffle(numbers)
if (arr==numbers).any(): continue
arr = np.concatenate((arr, numbers[None,:]),axis=0)
if arr.shape[0]==n: break
return arr, i
Why is translated Sudoku solver slower than original?
I found that with this translation of Java Sudoku solver, that using Python lists was faster than numpy arrays.
I may try to adapt that script to this problem - tomorrow.
EDIT: Below is an implementation of the second solution in norok2's answer.
EDIT: we can shuffle the generated square again to make it real random.
So the solve functions can be modified to:
def solve(numbers):
shuffle(numbers)
shift = randint(1, len(numbers)-1)
res = []
for r in xrange(len(numbers)):
res.append(list(numbers))
numbers = list(numbers[shift:] + numbers[0:shift])
rows = range(len(numbers))
shuffle(rows)
shuffled_res = []
for i in xrange(len(rows)):
shuffled_res.append(res[rows[i]])
return shuffled_res
EDIT: I previously misunderstand the question.
So, here's a 'quick' method which generates a 'to-some-extent' random solutions.
The basic idea is,
a, b, c
b, c, a
c, a, b
We can just move a row of data by a fixed step to form the next row. Which will qualify our restriction.
So, here's the code:
from random import shuffle, randint
def solve(numbers):
shuffle(numbers)
shift = randint(1, len(numbers)-1)
res = []
for r in xrange(len(numbers)):
res.append(list(numbers))
numbers = list(numbers[shift:] + numbers[0:shift])
return res
def check(arr):
for c in xrange(len(arr)):
col = [arr[r][c] for r in xrange(len(arr))]
if len(set(col)) != len(col):
return False
return True
if __name__ == '__main__':
from pprint import pprint
res = solve(range(5))
pprint(res)
print check(res)
This is a possible solution by itertools, if you don't insist on using numpy which I'm not familiar with:
import itertools
from random import randint
list(itertools.permutations(range(1, 6)))[randint(0, len(range(1, 6))]
# itertools returns a iterator of all possible permutations of the given list.
Can't type code from the phone, here's the pseudocode:
Create a matrix with one diamention more than tge target matrix(3 d)
Initialize the 25 elements with numbers from 1 to 5
Iterate over the 25 elements.
Choose a random value for the first element from the element list(which contains numbers 1 through 5)
Remove the randomly chosen value from all the elements in its row and column.
Repeat steps 4 and 5 for all the elements.
I'm sorry for the poor phrasing of the question, but it was the best I could do.
I know exactly what I want, but not exactly how to ask for it.
Here is the logic demonstrated by an example:
Two conditions that take on the values 1 or 0 trigger a signal that also takes on the values 1 or 0. Condition A triggers the signal (If A = 1 then signal = 1, else signal = 0) no matter what. Condition B does NOT trigger the signal, but the signal stays triggered if condition B stays equal to 1
after the signal previously has been triggered by condition A.
The signal goes back to 0 only after both A and B have gone back to 0.
1. Input:
2. Desired output (signal_d) and confirmation that a for loop can solve it (signal_l):
3. My attempt using numpy.where():
4. Reproducible snippet:
# Settings
import numpy as np
import pandas as pd
import datetime
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
# My attempt with np.where in column signal_v1
df['Signal_v1'] = df['condition_A'].copy()
df['Signal_v1'] = np.where(df.condition_A == 1, 1, np.where( (df.shift(1).Signal_v1 == 1) & (df.condition_B == 1), 1, 0))
print(df)
This is pretty straight forward using a for loop with lagged values and nested if sentences, but I can't figure it out using vectorized functions like numpy.where(). And I know this would be much faster for bigger data frames.
Thank you for any suggestions!
I don't think there is a way to vectorize this operation that will be significantly faster than a Python loop. (At least, not if you want to stick with just Python, pandas and numpy.)
However, you can improve the performance of this operation by simplifying your code. Your implementation uses if statements and a lot of DataFrame indexing. These are relatively costly operations.
Here's a modification of your script that includes two functions: add_signal_l(df) and add_lagged(df). The first is your code, just wrapped up in a function. The second uses a simpler function to achieve the same result--still a Python loop, but it uses numpy arrays and bitwise operators.
import numpy as np
import pandas as pd
import datetime
#-----------------------------------------------------------------------
# Create the test DataFrame
# Data frame with input and desired output i column signal_d
df = pd.DataFrame({'condition_A':list('00001100000110'),
'condition_B':list('01110011111000'),
'signal_d':list('00001111111110')})
colnames = list(df)
df[colnames] = df[colnames].apply(pd.to_numeric)
datelist = pd.date_range(pd.datetime.today().strftime('%Y-%m-%d'), periods=14).tolist()
df['dates'] = datelist
df = df.set_index(['dates'])
#-----------------------------------------------------------------------
def add_signal_l(df):
# Solution using a for loop with nested ifs in column signal_l
df['signal_l'] = df['condition_A'].copy(deep = True)
i=0
for observations in df['signal_l']:
if df.ix[i,'condition_A'] == 1:
df.ix[i,'signal_l'] = 1
else:
# Signal previously triggered by condition_A
# AND kept "alive" by condition_B:
if df.ix[i - 1,'signal_l'] & df.ix[i,'condition_B'] == 1:
df.ix[i,'signal_l'] = 1
else:
df.ix[i,'signal_l'] = 0
i = i + 1
def compute_lagged_signal(a, b):
x = np.empty_like(a)
x[0] = a[0]
for i in range(1, len(a)):
x[i] = a[i] | (x[i-1] & b[i])
return x
def add_lagged(df):
df['lagged'] = compute_lagged_signal(df['condition_A'].values, df['condition_B'].values)
Here's a comparison of the timing of the two function, run in an IPython session:
In [85]: df
Out[85]:
condition_A condition_B signal_d
dates
2017-06-09 0 0 0
2017-06-10 0 1 0
2017-06-11 0 1 0
2017-06-12 0 1 0
2017-06-13 1 0 1
2017-06-14 1 0 1
2017-06-15 0 1 1
2017-06-16 0 1 1
2017-06-17 0 1 1
2017-06-18 0 1 1
2017-06-19 0 1 1
2017-06-20 1 0 1
2017-06-21 1 0 1
2017-06-22 0 0 0
In [86]: %timeit add_signal_l(df)
8.45 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [87]: %timeit add_lagged(df)
137 µs ± 581 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As you can see, add_lagged(df) is much faster.
I take 3 values of a column (third) and put these values into a row on 3 new columns. And merge the new and old columns into a new matrix A
Input timeseries in col nr3 values in col nr 1 and 2
[x x 1]
[x x 2]
[x x 3]
output : matrix A
[x x 1 0 0 0]
[x x 2 0 0 0]
[x x 3 1 2 3]
[x x 4 2 3 4]
So for brevity, first the code generates the matrix 6 rows / 3 col. The last column I want to use to fill 3 extra columns and merge it into a new matrix A. This matrix A was prefilled with 2 rows to offset the starting position.
I have implemented this idea in the code below and it takes a really long time to process large data sets.
How to improve the speed of this conversion
import numpy as np
matrix = np.arange(18).reshape((6, 3))
nr=3
A = np.zeros((nr-1,nr))
for x in range( matrix.shape[0]-nr+1):
newrow = (np.transpose( matrix[x:x+nr,2:3] ))
A = np.vstack([A , newrow])
total= np.column_stack((matrix,A))
print (total)
Here's an approach using broadcasting to get those sliding windowed elements and then just some stacking to get A -
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
col2_2D = np.take(col2,np.arange(nrows)[:,None] + np.arange(nr))
out[nr-1:] = col2_2D
Here's an efficient alternative using NumPy strides to get col2_2D -
n = col2.strides[0]
col2_2D = np.lib.stride_tricks.as_strided(col2, shape=(nrows,nr), strides=(n,n))
It would be even better to initialize an output array of zeros of the size as total and then assign values into it with col2_2D and finally with input array matrix.
Runtime test
Approaches as functions -
def org_app1(matrix,nr):
A = np.zeros((nr-1,nr))
for x in range( matrix.shape[0]-nr+1):
newrow = (np.transpose( matrix[x:x+nr,2:3] ))
A = np.vstack([A , newrow])
return A
def vect_app1(matrix,nr):
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
col2_2D = np.take(col2,np.arange(nrows)[:,None] + np.arange(nr))
out[nr-1:] = col2_2D
return out
def vect_app2(matrix,nr):
col2 = matrix[:,2]
nrows = col2.size-nr+1
out = np.zeros((nr-1+nrows,nr))
n = col2.strides[0]
col2_2D = np.lib.stride_tricks.as_strided(col2, \
shape=(nrows,nr), strides=(n,n))
out[nr-1:] = col2_2D
return out
Timings and verification -
In [18]: # Setup input array and params
...: matrix = np.arange(1800).reshape((60, 30))
...: nr=3
...:
In [19]: np.allclose(org_app1(matrix,nr),vect_app1(matrix,nr))
Out[19]: True
In [20]: np.allclose(org_app1(matrix,nr),vect_app2(matrix,nr))
Out[20]: True
In [21]: %timeit org_app1(matrix,nr)
1000 loops, best of 3: 646 µs per loop
In [22]: %timeit vect_app1(matrix,nr)
10000 loops, best of 3: 20.6 µs per loop
In [23]: %timeit vect_app2(matrix,nr)
10000 loops, best of 3: 21.5 µs per loop
In [28]: # Setup input array and params
...: matrix = np.arange(7200).reshape((120, 60))
...: nr=30
...:
In [29]: %timeit org_app1(matrix,nr)
1000 loops, best of 3: 1.19 ms per loop
In [30]: %timeit vect_app1(matrix,nr)
10000 loops, best of 3: 45 µs per loop
In [31]: %timeit vect_app2(matrix,nr)
10000 loops, best of 3: 27.2 µs per loop