I have a pandas dataframe containing a range of columns A, B, C, D (either 0 or 1) and a range of columns AB, AC, BC, CD that contain their interaction (also either 0 or 1).
Based on the interactions, I want to establish the existence of "triplets" ABC, ABD, ACD, BCD as in the following MWE:
import numpy as np
import pandas as pd
df = pd.DataFrame()
np.random.seed(1)
df["A"] = np.random.randint(2, size=10)
df["B"] = np.random.randint(2, size=10)
df["C"] = np.random.randint(2, size=10)
df["D"] = np.random.randint(2, size=10)
df["AB"] = np.random.randint(2, size=10)
df["AC"] = np.random.randint(2, size=10)
df["AD"] = np.random.randint(2, size=10)
df["BC"] = np.random.randint(2, size=10)
df["BD"] = np.random.randint(2, size=10)
df["CD"] = np.random.randint(2, size=10)
ls = ["A", "B", "C", "D"]
for i, a in enumerate(ls):
for j in range(i + 1, len(ls)):
b = ls[j]
for k in range(j + 1, len(ls)):
c = ls[k]
idx = a+b+c
idx_abc = (df[a]>0) & (df[b]>0) & (df[c]>0)
sum_abc = df[idx_abc][a+b] + df[idx_abc][b+c] + df[idx_abc][a+c]
df[a+b+c]=0
df.loc[sum_abc.index[sum_abc>=2], a+b+c] = 999
This gives the following output:
A B C D AB AC AD BC BD CD ABC ABD ACD BCD
0 1 0 0 0 1 0 0 1 1 0 0 0 0 0
1 1 1 1 0 1 1 1 1 0 0 999 0 0 0
2 0 0 0 1 1 0 1 0 0 1 0 0 0 0
3 0 1 0 1 1 0 0 0 1 1 0 0 0 0
4 1 1 1 1 1 1 1 0 1 1 999 999 999 999
5 1 0 0 1 1 1 1 0 0 0 0 0 0 0
6 1 0 0 1 0 1 1 1 1 1 0 0 0 0
7 1 1 0 0 1 0 1 1 1 1 0 0 0 0
8 1 0 1 0 1 1 0 1 0 0 0 0 0 0
9 0 0 0 0 0 0 0 0 1 1 0 0 0 0
The logic behind the code is the following: A triplet ABC is active (=1) if at least two of the columns AB, AC, BC are active (=1) and the individual columns A, B, C are all active (=1).
I always start by looking at the individual columns (in the case of ABC then this is A, B and C). Looking at columns A, B and C, we only "keep" the rows where A, B and C are all non-zero. Then, looking at the interactions AB, AC and BC, we only "enable" the triplet ABC if at least two out of AB, AC and BC are 1 - which they are only for rows 1 and 4! Hence ABC = 999 for rows 1 and 4 and 0 for all others. This I do for all possible triplets (4 in this case).
The above code runs fast as the dataframe is small. However, in my real code the dataframe has more than one million rows and hundreds of interactions, in which case it runs extremely slow.
Is there a way to optimize the above code, e.g. by multithreading it?
Here is a method that is about 10x faster than your reference code. It does nothing particularly clever, just pedestrian optimization.
import numpy as np
import pandas as pd
df = pd.DataFrame()
np.random.seed(1)
df["A"] = np.random.randint(2, size=10)
df["B"] = np.random.randint(2, size=10)
df["C"] = np.random.randint(2, size=10)
df["D"] = np.random.randint(2, size=10)
df["AB"] = np.random.randint(2, size=10)
df["AC"] = np.random.randint(2, size=10)
df["AD"] = np.random.randint(2, size=10)
df["BC"] = np.random.randint(2, size=10)
df["BD"] = np.random.randint(2, size=10)
df["CD"] = np.random.randint(2, size=10)
ls = ["A", "B", "C", "D"]
def op():
out = df.copy()
for i, a in enumerate(ls):
for j in range(i + 1, len(ls)):
b = ls[j]
for k in range(j + 1, len(ls)):
c = ls[k]
idx = a+b+c
idx_abc = (out[a]>0) & (out[b]>0) & (out[c]>0)
sum_abc = out[idx_abc][a+b] + out[idx_abc][b+c] + out[idx_abc][a+c]
out[a+b+c]=0
out.loc[sum_abc.index[sum_abc>=2], a+b+c] = 99
return out
import scipy.spatial.distance as ssd
def pp():
data = df.values
n = len(ls)
d1,d2 = np.split(data, [n], axis=1)
i,j = np.triu_indices(n,1)
d2 = d2 & d1[:,i] & d1[:,j]
k,i,j = np.ogrid[:n,:n,:n]
k,i,j = np.where((k<i)&(i<j))
lu = ssd.squareform(np.arange(n*(n-1)//2))
d3 = ((d2[:,lu[k,i]]+d2[:,lu[i,j]]+d2[:,lu[k,j]])>=2).view(np.uint8)*99
*triplets, = map("".join, combinations(ls,3))
out = df.copy()
out[triplets] = pd.DataFrame(d3, columns=triplets)
return out
from string import ascii_uppercase
from itertools import combinations, chain
def make(nl=8, nr=1000000, seed=1):
np.random.seed(seed)
letters = np.fromiter(ascii_uppercase, 'U1', nl)
df = pd.DataFrame()
for l in chain(letters, map("".join,combinations(letters,2))):
df[l] = np.random.randint(0,2,nr,dtype=np.uint8)
return letters, df
df1 = op()
df2 = pp()
assert (df1==df2).all().all()
ls, df = make(8,1000)
df1 = op()
df2 = pp()
assert (df1==df2).all().all()
from timeit import timeit
print(timeit(op,number=10))
print(timeit(pp,number=10))
ls, df = make(26,250000)
import time
t0 = time.perf_counter()
df2 = pp()
t1 = time.perf_counter()
print(t1-t0)
Sample run:
3.2022583668585867 # op 8 symbols, 1000 rows, 10 repeats
0.2772211490664631 # pp 8 symbols, 1000 rows, 10 repeats
12.412292044842616 # pp 26 symbols, 250,000 rows, single run
Related
I am trying to create a new column for a dataframe. The column I use for it is a price column. Basically what I am trying to achieve is getting the number of times that the price has increased/decreased consecutively. I need this to be rather quick because the dataframes can be quite big.
For example the result should look like :
input = [1,2,3,2,1]
increase = [0,1,2,0,0]
decrease = [0,0,0,1,2]
You can compute the diff and apply a cumsum on the positive/negative values:
df = pd.DataFrame({'col': [1,2,3,2,1]})
s = df['col'].diff()
df['increase'] = s.gt(0).cumsum().where(s.gt(0), 0)
df['decrease'] = s.lt(0).cumsum().where(s.lt(0), 0)
Output:
col increase decrease
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
resetting the count
As I realize your example is ambiguous, here is an additional method in case your want to reset the counts for each increasing/decreasing group, using groupby.
The resetting counts are labeled inc2/dec2:
df = pd.DataFrame({'col': [1,2,3,2,1,2,3,1]})
s = df['col'].diff()
s1 = s.gt(0)
s2 = s.lt(0)
df['inc'] = s1.cumsum().where(s1, 0)
df['dec'] = s2.cumsum().where(s2, 0)
si = df['inc'].eq(0)
sd = df['dec'].eq(0)
df['inc2'] = si.groupby(si.cumsum()).cumcount()
df['dec2'] = sd.groupby(sd.cumsum()).cumcount()
Output:
col inc dec inc2 dec2
0 1 0 0 0 0
1 2 1 0 1 0
2 3 2 0 2 0
3 2 0 1 0 1
4 1 0 2 0 2
5 2 3 0 1 0
6 3 4 0 2 0
7 1 0 3 0 1
data = {
'input': [1,2,3,2,1]
}
df = pd.DataFrame(data)
diffs = df['input'].diff()
df['a'] = (df['input'] > df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] > df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] > df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
df['b'] = (df['input'] < df['input'].shift(periods=1, axis=0)).cumsum()-(df['input'] < df['input'].shift(periods=1, axis=0)).astype(int).cumsum() \
.where(~(df['input'] < df['input'].shift(periods=1, axis=0))) \
.ffill().fillna(0).astype(int)
print(df)
output
input a b
0 1 0 0
1 2 1 0
2 3 2 0
3 2 0 1
4 1 0 2
Coding this manually using numpy might look like this
import numpy as np
input = np.array([1, 2, 3, 2, 1])
increase = np.zeros(len(input))
decrease = np.zeros(len(input))
for i in range(1, len(input)):
if input[i] > input[i-1]:
increase[i] = increase[i-1] + 1
decrease[i] = 0
elif input[i] < input[i-1]:
increase[i] = 0
decrease[i] = decrease[i-1] + 1
else:
increase[i] = 0
decrease[i] = 0
increase # array([0, 1, 2, 0, 0], dtype=int32)
decrease # array([0, 0, 0, 1, 2], dtype=int32)
I have the following df :
df = data.frame("T" = c(1,2,3,5,1,3,2,5), "A" = c("0","0","1","1","0","1","0","1"), "B" = c("0","0","0","1","0","0","0","1"))
df
T A B
1 1 0 0
2 2 0 0
3 3 1 0
4 5 1 1
5 1 0 0
6 3 1 0
7 2 0 0
8 5 1 1
Column A & B were the results of as follow:
df['A'] = [int(x) for x in total_df["T"] >= 3]
df['B'] = [int(x) for x in total_df["T"] >= 5]
I have a data spilt
train_size = 0.6
training = df.head(int(train_size * df.shape[0]))
test = df.tail(int((1 - train_size) * df.shape[0]))
Here is the question:
How can I pass row values from "T" to a list called 'tr_return' from 'training' where both columns "A" & "B" are == 1?
I tried this:
tr_returns = training[training['A' and 'B'] == 1]['T'] or
tr_returns = training[training['A'] == 1 and training['B'] == 1]['T']
But neither one works :( Any help will be appreciated!
I have to following data frame
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
we can see 4 columns A,B,C,D the intended outcome is to replace the contents of B with the contents of D, if a condition on C is met, for this example the condition is of C = 1
the intended output is
A = [1,2,5,4,3,1]
B = ["y","No","y","y","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
output_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
output_df.drop('D', axis = 1)
What is the best way to apply this logic to a data frame?
There are many ways to solve, here is another one:
test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This can be done with np.where:
test_df['B'] = np.where(test_df['C']==1, test_df['D'], test_df['B'])
Output:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
The desired output is achieved using .loc with column 'C' as the mask.
test_df.loc[test_df['C']==1,'B'] = test_df.loc[test_df['C']==1,'D']
UPDATE: Just found out a similar answer is posted by #QuangHoang. This answer is slightly different in that it does not require numpy
I don't know if inverse is the right word here, but I noticed recently that mask and where are "inverses" of each other. If you pass a ~ to the condition of a .where statement, then you get the same result as mask:
A = [1,2,5,4,3,1]
B = ["yes","No","hello","yes","no", 'why']
C = [1,0,1,1,0,0]
D = ['y','n','y','y','n','n']
test_df = pd.DataFrame({'A': A, 'B': B, 'C': C, 'D':D})
test_df['B'] = test_df['B'].where(~(test_df['C'] == 1), test_df['D'])
# test_df['B'] = test_df['B'].mask(test_df['C'] == 1, test_df['D']) - Scott Boston's answer
test_df
Out[1]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
You can also use df.where:
test_df['B'] = test_df['D'].where(test_df.C.eq(1), test_df.B)
Output:
In [875]: test_df
Out[875]:
A B C D
0 1 y 1 y
1 2 No 0 n
2 5 y 1 y
3 4 y 1 y
4 3 no 0 n
5 1 why 0 n
This question already has answers here:
Calculate new value based on decreasing value
(4 answers)
Closed 5 years ago.
Given the following table
vals
0 20
1 3
2 2
3 10
4 20
I'm trying to find a clean solution in pandas to subtract away a value, say 30 for example, to end with the following result.
vals
0 0
1 0
2 0
3 5
4 20
I was wondering if pandas had a solution to performing this that didn't require looping through all the rows in a dataframe, something that takes advantage of pandas's bulk operations.
identify where cumsum is greater than or equal to 30
mask the rows where it isn't
reassign the one row to be the cumsum less 30
c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)
vals
0 0
1 0
2 0
3 5
4 20
Same thing, but numpyfied
v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)
vals
0 0
1 0
2 0
3 5
4 20
Timing
%%timeit
v = df.vals.values
c = v.cumsum()
m = c >= 30
i = m.argmax()
n = np.where(m, v, 0)
n[i] = c[i] - 30
df.assign(vals=n)
10000 loops, best of 3: 168 µs per loop
%%timeit
c = df.vals.cumsum()
m = c.ge(30)
i = m.idxmax()
n = df.vals.where(m, 0)
n.loc[i] = c.loc[i] - 30
df.assign(vals=n)
1000 loops, best of 3: 853 µs per loop
Here's one using NumPy with four lines of code -
v = df.vals.values
a = v.cumsum()-30
idx = (a>0).argmax()+1
v[:idx] = a.clip(min=0)[:idx]
Sample run -
In [274]: df # Original df
Out[274]:
vals
0 20
1 3
2 2
3 10
4 20
In [275]: df.iloc[3,0] = 7 # Bringing in some variety
In [276]: df
Out[276]:
vals
0 20
1 3
2 2
3 7
4 20
In [277]: v = df.vals.values
...: a = v.cumsum()-30
...: idx = (a>0).argmax()+1
...: v[:idx] = a.clip(min=0)[:idx]
...:
In [278]: df
Out[278]:
vals
0 0
1 0
2 0
3 2
4 20
#A one-liner solution
df['vals'] = df.assign(res = 30-df.vals.cumsum()).apply(lambda x: 0 if x.res>0 else x.vals if abs(x.res)>x.vals else x.vals-abs(x.res), axis=1)
df
Out[96]:
vals
0 0
1 0
2 0
3 5
4 20
I'm trying to multiply two pandas dataframes with each other. Specifically, I want to multiply every column with every column of the other df.
The dataframes are one-hot encoded, so they look like this:
col_1, col_2, col_3, ...
0 1 0
1 0 0
0 0 1
...
I could just iterate through each of the columns using a for loop, but in python that is computationally expensive, and I'm hoping there's an easier way.
One of the dataframes has 500 columns, the other has 100 columns.
This is the fastest version that I've been able to write so far:
interact_pd = pd.DataFrame(index=df_1.index)
df1_columns = [column for column in df_1]
for column in df_2:
col_pd = df_1[df1_columns].multiply(df_2[column], axis="index")
interact_pd = interact_pd.join(col_pd, lsuffix='_' + column)
I iterate over each column in df_2 and multiply all of df_1 by that column, then I append the result to interact_pd. I would rather not do it using a for loop however, as this is very computationally costly. Is there a faster way of doing it?
EDIT: example
df_1:
1col_1, 1col_2, 1col_3
0 1 0
1 0 0
0 0 1
df_2:
2col_1, 2col_2
0 1
1 0
0 0
interact_pd:
1col_1_2col_1, 1col_2_2col_1,1col_3_2col_1, 1col_1_2col_2, 1col_2_2col_2,1col_3_2col_2
0 0 0 0 1 0
1 0 0 0 0 0
0 0 0 0 0 0
# use numpy to get a pair of indices that map out every
# combination of columns from df_1 and columns of df_2
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
# use pandas MultiIndex to create a nice MultiIndex for
# the final output
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
# df_1.values[:, pidx[0]] slices df_1 values for every combination
# like wise with df_2.values[:, pidx[1]]
# finally, I marry up the product of arrays with the MultiIndex
pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
Timing
code
from string import ascii_letters
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 26)), columns=list(ascii_letters[:26]))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 52)), columns=list(ascii_letters))
def pir1(df_1, df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
results
You can multiply along the index axis your first df with each column of the second df, this is the fastest method for big datasets (see below):
df = pd.concat([df_1.mul(col[1], axis="index") for col in df_2.iteritems()], axis=1)
# Change the name of the columns
df.columns = ["_".join([i, j]) for j in df_2.columns for i in df_1.columns]
df
1col_1_2col_1 1col_2_2col_1 1col_3_2col_1 1col_1_2col_2 \
0 0 0 0 0
1 1 0 0 0
2 0 0 0 0
1col_2_2col_2 1col_3_2col_2
0 1 0
1 0 0
2 0 0
--> See benchmark for comparisons with other answers to choose the best option for your dataset.
Benchmark
Functions:
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
def Test3(df_1, df_2):
df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1)
df.columns = ["_".join([i,j]) for j in df_2.columns for i in df_1.columns]
return df
def Test4(df_1,df_2):
pidx = np.indices((df_1.shape[1], df_2.shape[1])).reshape(2, -1)
lcol = pd.MultiIndex.from_product([df_1.columns, df_2.columns],
names=[df_1.columns.name, df_2.columns.name])
return pd.DataFrame(df_1.values[:, pidx[0]] * df_2.values[:, pidx[1]],
columns=lcol)
def jeanrjc_imp(df_1, df_2):
df = pd.concat([df_1.mul(i[1], axis="index") for i in df_2.iteritems()], axis=1, keys=df_2.columns)
return df
Code:
Sorry, ugly code, the plot at the end matters :
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
df_1 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_2 = pd.DataFrame(np.random.randint(0, 2, (1000, 600)))
df_1.columns = ["1col_"+str(i) for i in range(len(df_1.columns))]
df_2.columns = ["2col_"+str(i) for i in range(len(df_2.columns))]
resa = {}
resb = {}
resc = {}
for f, r in zip([Test2, Test3, Test4, jeanrjc_imp], ["T2", "T3", "T4", "T3bis"]):
resa[r] = []
resb[r] = []
resc[r] = []
for i in [5, 10, 30, 50, 150, 200]:
a = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :10])
b = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :50])
c = %timeit -o f(df_1.iloc[:,:i], df_2.iloc[:, :200])
resa[r].append(a.best)
resb[r].append(b.best)
resc[r].append(c.best)
X = [5, 10, 30, 50, 150, 200]
fig, ax = plt.subplots(1, 3, figsize=[16,5])
for j, (a, r) in enumerate(zip(ax, [resa, resb, resc])):
for i in r:
a.plot(X, r[i], label=i)
a.set_xlabel("df_1 columns #")
a.set_title("df_2 columns # = {}".format(["10", "50", "200"][j]))
ax[0].set_ylabel("time(s)")
plt.legend(loc=0)
plt.tight_layout()
With T3b <=> jeanrjc_imp. Which is a bit faster that Test3.
Conclusion:
Depending on your dataset size, pick the right function, between Test4 and Test3(b). Given the OP's dataset, Test3 or jeanrjc_imp should be the fastest, and also the shortest to write!
HTH
You can use numpy.
Consider this example code, I did modify the variable names, but Test1() is essentially your code. I didn't bother create the correct column names in that function though:
import pandas as pd
import numpy as np
A = [[1,0,1,1],[0,1,1,0],[0,1,0,1]]
B = [[0,0,1,0],[1,0,1,0],[1,1,0,0],[1,0,0,1],[1,0,0,0]]
DA = pd.DataFrame(A).T
DB = pd.DataFrame(B).T
def Test1(DA,DB):
E = pd.DataFrame(index=DA.index)
DAC = [column for column in DA]
for column in DB:
C = DA[DAC].multiply(DB[column], axis="index")
E = E.join(C, lsuffix='_' + str(column))
return E
def Test2(DA,DB):
MA = DA.as_matrix()
MB = DB.as_matrix()
MM = np.zeros((len(MA),len(MA[0])*len(MB[0])))
Col = []
for i in range(len(MB[0])):
for j in range(len(MA[0])):
MM[:,i*len(MA[0])+j] = MA[:,j]*MB[:,i]
Col.append('1col_'+str(i+1)+'_2col_'+str(j+1))
return pd.DataFrame(MM,dtype=int,columns=Col)
print Test1(DA,DB)
print Test2(DA,DB)
Output:
0_1 1_1 2_1 0 1 2 0_3 1_3 2_3 0 1 2 0 1 2
0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0
1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
2 1 1 0 1 1 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0
1col_1_2col_1 1col_1_2col_2 1col_1_2col_3 1col_2_2col_1 1col_2_2col_2 \
0 0 0 0 1 0
1 0 0 0 0 0
2 1 1 0 1 1
3 0 0 0 0 0
1col_2_2col_3 1col_3_2col_1 1col_3_2col_2 1col_3_2col_3 1col_4_2col_1 \
0 0 1 0 0 1
1 0 0 1 1 0
2 0 0 0 0 0
3 0 0 0 0 1
1col_4_2col_2 1col_4_2col_3 1col_5_2col_1 1col_5_2col_2 1col_5_2col_3
0 0 0 1 0 0
1 0 0 0 0 0
2 0 0 0 0 0
3 0 1 0 0 0
Performance of your function:
%timeit(Test1(DA,DB))
100 loops, best of 3: 11.1 ms per loop
Performance of my function:
%timeit(Test2(DA,DB))
1000 loops, best of 3: 464 µs per loop
It's not beautiful, but it's efficient.