I have a dataframe with dates (30/09/2022 to 31/11/2022) and 15 stock prices (wrote 5 as reference) for each of these dates (excluding weekends).
Current Data:
DATES | A | B | C | D | E |
30/09/22 |100.5|151.3|233.4|237.2|38.42|
01/10/22 |101.5|148.0|237.6|232.2|38.54|
02/10/22 |102.2|147.6|238.3|231.4|39.32|
03/10/22 |103.4|145.7|239.2|232.2|39.54|
I wanted to get the Pearson correlation matrix, so I did this:
df = pd.read_excel(file_path, sheet_name)
df=df.dropna() #Remove dates that do not have prices for all stocks
log_df = df.set_index("DATES").pipe(lambda d: np.log(d.div(d.shift()))).reset_index()
corrM = log_df.corr()
Now I want to build the Pearson Uncentered Correlation Matrix, so I have the following function:
def uncentered_correlation(x, y):
x_dim = len(x)
y_dim = len(y)
xy = 0
xx = 0
yy = 0
for i in range(x_dim):
xy = xy + x[i] * y[i]
xx = xx + x[i] ** 2.0
yy = yy + y[i] ** 2.0
corr = xy/np.sqrt(xx*yy)
return(corr)
However, I do not know how to apply this function to each possible pair of columns of the dataframe to get the correlation matrix.
try this? not elegant enough, but perhaps working for you. :)
from itertools import product
def iter_product(a, b):
return list(product(a, b))
df='your dataframe hier'
re_dict={}
iter_re=iter_product(df.columns,df.columns)
for i in iter_re:
result=uncentered_correlation(df[f'{i[0]}'],df[f'{i[1]}'])
re_dict[i]=result
re_df=pd.DataFrame(re_dict,index=[0]).stack()
First compute a list of possible column combinations. You can use the itertools library for that
Then use the pandas.DataFrame.apply() over multiple columns as explained here
Here is a simple code example:
import pandas as pd
import itertools
data = {'col1': [1,3], 'col2': [2,4], 'col3': [5,6]}
df = pd.DataFrame(data)
def add(num1,num2):
return num1 + num2
cols = list(df)
combList = list(itertools.combinations(cols, 2))
for tup in combList:
firstCol = tup[0]
secCol = tup[1]
df[f'sum_{firstCol}_{secCol}'] = df.apply(lambda x: add(x[firstCol], x[secCol]), axis=1)
I have written a function to find frequency of itemsets of size k given candidate itemsets. Dataset contains more than 16000 transactions. Can someone please help me in optimizing this function as with current form it is taking about 45 minutes to execute with minSupport=1.
Sample dataset
Algorithm 0 (See other algorithms below)
Implemented boost of your algorithm using Numba. Numba is a JIT compiler that converts Python code to very highly optimized C++ code and then compiles to machine code. For many algorithms Numba achieves speed boost of 50-200x times.
To use numba you have to install it through pip install numba, notice that Numba is only supported for Python <= 3.8, for 3.9 it is not yet released!
I have rewritten your code a bit to satisfy Numba compilation requirements, my code should be identical by behaviour to yours, please do some tests.
My numba optimized code should give you very good speedup!
I created some artificial short example input data too, to make tests.
Try it online!
import numba, numpy as np, pandas as pd
#numba.njit(cache = True)
def selectLkNm(dataSet,Ck,minSupport):
dict_data = {}
transactions = dataSet.shape[0]
for items in Ck:
count = 0
while count < transactions:
if items not in dict_data:
dict_data[items] = 0
for item in items:
for e in dataSet[count, :]:
if item == e:
break
else:
break
else:
dict_data[items] += 1
count += 1
Lk = {}
for k, v in dict_data.items():
if v >= minSupport:
Lk[k] = v
return Lk
def selectLk(dataSet, Ck, minSupport):
tCk = numba.typed.List()
for e in Ck:
tCk.append(e)
return selectLkNm(dataSet.values, tCk, minSupport)
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output:
{(100, 160): 2, (190, 200): 2}
Algorithm 1 (See other algorithms below)
I improved Algorithm 0 (above) by sorting your data, it will give a good speedup if you have many values inside your Ck or each tuple inside Ck is quite long.
Try it online!
import numba, numpy as np, pandas as pd
#numba.njit(cache = True)
def selectLkNm(dataSet,Ck,minSupport):
assert dataSet.ndim == 2
dataSet2 = np.empty_like(dataSet)
for i in range(dataSet.shape[0]):
dataSet2[i] = np.sort(dataSet[i])
dataSet = dataSet2
dict_data = {}
transactions = dataSet.shape[0]
for items in Ck:
count = 0
while count < transactions:
if items not in dict_data:
dict_data[items] = 0
for item in items:
ix = np.searchsorted(dataSet[count, :], item)
if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
break
else:
dict_data[items] += 1
count += 1
Lk = {}
for k, v in dict_data.items():
if v >= minSupport:
Lk[k] = v
return Lk
def selectLk(dataSet, Ck, minSupport):
tCk = numba.typed.List()
for e in Ck:
tCk.append(e)
return selectLkNm(dataSet.values, tCk, minSupport)
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output:
{(100, 160): 2, (190, 200): 2}
Algorithm 2 (See other algorithms below)
If you're not allowed to use Numba, then I suggest you next improvements to your algorithm. I pre-sort your dataset to make search of each item not in O(N) time but in O(Log(N)) time which is much much faster.
I see in your code you used pandas dataframe, it means you have installed pandas, and if you installed pandas then you definitely have Numpy, so I decided to use it. You can't have no Numpy if you're dealing with pandas dataframe.
Try it online!
import numpy as np, pandas as pd, collections
def selectLk(dataSet,Ck,minSupport):
dataSet = np.sort(dataSet.values, axis = 1)
dict_data = collections.defaultdict(int)
transactions = dataSet.shape[0]
for items in Ck:
count = 0
while count < transactions:
for item in items:
ix = np.searchsorted(dataSet[count, :], item)
if not (ix < dataSet.shape[1] and dataSet[count, ix] == item):
break
else:
dict_data[items] += 1
count += 1
Lk = {k : v for k, v in dict_data.items() if v >= minSupport}
return Lk
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output:
{(100, 160): 2, (190, 200): 2}
Algorithm 3
I just had an idea that sorting part of Algorithm 2 may be not the bottleneck, probably transactions while loop can be a bottleneck.
So to improve situation I decided to implement and use a faster algorithm with 2D searchsorted version (there is no built-in 2D version, so it had to be implemented separately), which doesn't have any long pure-python loops, most time is spent in Numpy functions.
Please try if this Algo 3 will be faster, it should be only faster if not sorting was a bottleneck but inner while loop.
Try it online!
import numpy as np, pandas as pd, collections
def selectLk(dataSet, Ck, minSupport):
def searchsorted2d(a, bs):
s = np.r_[0, (np.maximum(a.max(1) - a.min(1) + 1, bs.ravel().max(0)) + 1).cumsum()[:-1]]
a_scaled = (a + s[:, None]).ravel()
def sub(b):
b_scaled = b + s
return np.searchsorted(a_scaled, b_scaled) - np.arange(len(s)) * a.shape[1]
return sub
assert dataSet.values.ndim == 2, dataSet.values.ndim
dataSet = np.sort(dataSet.values, axis = 1)
dict_data = collections.defaultdict(int)
transactions = dataSet.shape[0]
Ck = np.array(list(Ck))
assert Ck.ndim == 2, Ck.ndim
ss = searchsorted2d(dataSet, Ck)
for items in Ck:
cnts = np.zeros((dataSet.shape[0],), dtype = np.int64)
for item in items:
bs = item.repeat(dataSet.shape[0])
ixs = np.minimum(ss(bs), dataSet.shape[1] - 1)
cnts[...] += (dataSet[(np.arange(dataSet.shape[0]), ixs)] == bs).astype(np.uint8)
dict_data[tuple(items)] += int((cnts == len(items)).sum())
return {k : v for k, v in dict_data.items() if v >= minSupport}
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk = selectLk(dataset, C1, 2)
print(Lk)
Output:
{(100, 160): 2, (190, 200): 2}
I have changed the order of execution of your code. However, since I do not have access to your actual input data, it is difficult to check if the optimized code produces expected outputs and how much speed up you gained.
Algorithm 0
import pandas as pd
import numpy as np
from collections import defaultdict
def selectLk(dataSet,Ck,minSupport):
dict_data = defaultdict(int)
for _, row in dataSet.iterrows():
for items in Ck:
dict_data[items] += all(item in row.values for item in items)
Lk = { k : v for k,v in dict_data.items() if v > minSupport}
return Lk
if __name__ == '__main__':
data = list(range(0, 1000, 10))
df_data = {}
for i in range(26):
sample = np.random.choice(data, size=16000, replace=True)
df_data[f"d{i}"] = sample
dataset = pd.DataFrame(df_data)
C1 = set()
C1.add((100, 160))
C1.add((170, 180))
C1.add((190, 200))
Lk1 = selectLk(dataset, C1, 1)
dataset = pd.DataFrame([[100,160,100,160],[170,180,190,200],[100,160,190,200]])
Lk2 = selectLk(dataset, C1, 1)
print(Lk1)
print(Lk2)
Algorithm 1
Algorithm 1 utilizes numpy.equal.outer, which creates a boolean mask of any matching elements in the Ck tuples. Then, apply .all() operation.
def selectLk(dataSet, Ck, minSupport):
dict_data = defaultdict(int)
dataSet_np = dataSet.to_numpy(copy=False)
for items in Ck:
dict_data[items] = dataSet[np.equal.outer(dataSet_np, items).any(axis=1).all(axis=1)].shape[0]
Lk = { k : v for k, v in dict_data.items() if v > minSupport}
return Lk
Result:
{(190, 200): 811, (170, 180): 797, (100, 160): 798}
{(190, 200): 2, (100, 160): 2}
I have over 500,000 rows in my dataframe and a number of similar 'for' loops which are causing my code to take over a hour to complete its computation. Is there a more efficient way of writing the following 'for' loop so that things run a lot faster:
col_26 = []
col_27 = []
col_28 = []
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_26.append('Yes')
col_27.append('No')
col_28.append(df['A_value'][ind])
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26.append('No')
col_27.append('Yes')
col_28.append(df['B_value'][ind])
else:
col_26.append('')
col_27.append('')
col_28.append(float('nan'))
You might want to look into the pandas iterrows() function or using apply, you can look at this article aswell: https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06
Try column operations:
data = {'A_factor': [1, 2, 3, 4, 5],
'A_value': [10, 20, 30, 40, 50],
'B_factor': [2, 3, 1, 2, 6],
'B_value': [11, 22, 33, 44, 55]}
df = pd.DataFrame(data)
df['col_26'] = ''
df['col_27'] = ''
df['col_28'] = np.nan
mask = df['A_factor'] > df['B_factor']
df.loc[mask, 'col_26'] = 'Yes'
df.loc[~mask, 'col_26'] = 'No'
df.loc[mask, 'col_28'] = df[mask]['A_value']
df.loc[~mask, 'col_27'] = 'Yes'
df.loc[mask, 'col_27'] = 'No'
df.loc[~mask, 'col_28'] = df[~mask]['B_value']
Appending to lists in Python is painfully slow. Initializing the lists before the iteration can speed things up. For example,
def f():
x = []
for ii in range(500000):
x.append(str(x))
def f2():
x = [""] * 500000
for ii in range(500000):
x[ii] = str(x)
timeit.timeit("f()", "from __main__ import f", number=10)
# Output: 1.6317970999989484
timeit.timeit("f2()", "from __main__ import f2", number=10)
# Output: 1.3037318000024243
Since you're already using pandas / numpy, there are ways to vectorize your operations so they don't need looping. For example:
a_factor = df["A_factor"].to_numpy()
b_factor = df["B_factor"].to_numpy()
col_26 = np.empty(a_factor.shape, dtype='U3') # U3 => string of size 3
col_27 = np.empty(a_factor.shape, dtype='U3')
col_28 = np.empty(a_factor.shape)
a_greater = a_factor > b_factor
b_greater = a_factor < b_factor
both_equal = a_factor == b_factor
col_26[a_greater] = 'Yes'
col_26[b_greater] = 'No'
col_27[a_greater] = 'Yes'
col_27[b_greater] = 'No'
col_28[a_greater] = a_factor[a_greater]
col_28[b_greater] = b_factor[b_greater]
col_28[both_equal] = np.nan
append causes python requests for heap memory to get more memory. using append in for loop causes get memory and free it continually to get more memory. so it's better to say to python how many item you need.
col_26 = [True]*500000
col_27 = [False]*500000
col_28 = [float('nan')]*500000
for ind in df.index:
if df['A_factor'][ind] > df['B_factor'][ind]:
col_28[ind] = df['A_value'][ind]
elif df['A_factor'][ind] < df['B_factor'][ind]:
col_26[ind] = False
col_27[ind] = True
col_28[ind] = df['B_value'][ind]
else:
col_26[ind] = ''
col_27[ind] = ''
I have two matrices. One is of size (CxK) and another is of size (SxK) (where S,C, and K all have the potential to be very large). I want to combine these an output matrix using the cosine similarity function (would be of size [CxS]). When I run my code, it takes a very long time to produce an output, and I was wondering if there is any way to optimize what I currently have. [Note, the two input matrices are often very sparse]
I was previously traversing each matrix using two for index,row loops, but I have since switched to the while loops, which improved my run time significantly.
A #this is one of my input matrices (pandas dataframe)
B #this is my second input matrix (pandas dataframe)
C = pd.DataFrame(columns = ['col_1' ,'col_2' ,'col_3'])
i=0
k=0
while i <= 5:
col_1 = A.iloc[i].get('label_A')
while k < 5:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
Right now I have the loops to run on only 5 items from each matrix, producing a 5x5 matrix, but I would obviously like this to work for very large inputs. This is the first time I have done anything like this so please let me know if any facet of code can be improved (data types used to hold matrices, how to traverse them, updating the output matrix, etc.).
Thank you in advance.
This can be done much more easyly and way faster by passing the whole arrays to cosine_similarity after you move the labels to the index:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import time
c = 50
s = 50
k = 100
A = pd.DataFrame( np.random.rand(c,k))
B = pd.DataFrame( np.random.rand(s,k))
A['label_A'] = [f'A{i}' for i in range(c)]
B['label_B'] = [f'B{i}' for i in range(s)]
C = pd.DataFrame()
# your program
start = time.time()
i=0
k=0
while i < c:
col_1 = A.iloc[i].get('label_A')
while k < s:
col_2 = B.iloc[k].get('label_B')
propensity = cosine_similarity([A.drop('label_A', axis=1)\
.iloc[i]], [B.drop('label_B',axis=1).iloc[k]])
d = {'col_1':[col_1], 'col_2':[col_2], 'col_3':[propensity[0][0]]}
to_append = pd.DataFrame(data=d)
C = C.append(to_append)
k += 1
k = 0
i += 1
print(f'elementwise: {time.time() - start:7.3f} s')
# my solution
start = time.time()
A = A.set_index('label_A')
B = B.set_index('label_B')
C1 = pd.DataFrame(cosine_similarity(A, B), index=A.index, columns=B.index).stack().rename('col_3')
C1.index.rename(['col_1','col_2'], inplace=True)
C1 = C1.reset_index()
print(f'whole array: {time.time() - start:7.3f} s')
# verification
assert(C[['col_1','col_2']].to_numpy()==C1[['col_1','col_2']].to_numpy()).all()\
and np.allclose(C.col_3.to_numpy(), C1.col_3.to_numpy())
I have a bunch of data (10M + records) that breaks down to an identifier, a location and a date. I want to find the number of times that any identifier moved from some locationA to some other locationB over the entire set of dates. Any identifier may not have a location for all possible dates. When an identifier does not have a location recorded, that should be treated as an actual 'unknown' location for that date.
Here is some reproducible fake data...
import numpy as np
import pandas as pd
import datetime
base = datetime.date.today()
num_days = 50
dates = np.array([base - datetime.timedelta(days=x) for x in range(num_days-1, -1, -1)])
ids = np.arange(50)
mi = pd.MultiIndex.from_product([ids, dates])
locations = np.array([chr(x) for x in 97 + np.random.randint(26, size=len(mi))])
s = pd.Series(locations, index=mi)
mask = np.random.rand(len(mi)) > .5
s[mask] = np.nan
s = s.dropna()
My initial thought was to create a dataframe and use boolean masking/vectorized operations to solve this
df = s.unstack(0).fillna('unknown')
Apparently my data is sparse enough to cause a MemoryError (from all the extra entries resulting from unstacking).
My current working solution is the following
def series_fn(s):
s = s.reindex(pd.date_range(s.index.levels[1].min(), s.index.levels[1].max()), level=-1).fillna('unknown')
mask_prev = (s != s.shift(-1))[:-1]
mask_next = (s != s.shift())[1:]
s_prev = s[:-1][mask_prev]
s_next = s[1:][mask_next]
s_tup = pd.Series(list(zip(s_prev, s_next)))
return s_tup.value_counts()
result_per_id = s.groupby(level=0).apply(series_fn)
result = result_per_id.sum(level=-1)
result looks like
(a, b) 1
(a, c) 5
(a, e) 3
(a, f) 3
(a, g) 3
(a, h) 3
(a, i) 1
(a, j) 1
(a, k) 2
(a, l) 2
...
This is going to take ~5 hours for all my data. Does anyone know any faster ways of doing this?
Thanks!
Hmmm, I guess I should have transposed the data... well that was a relatively simple fix. Instead of using groupby and apply,
s = s.reorder_levels(['date', 'id'])
s = s.sortlevel(0)
results = []
for i in range(len(s.index.levels[0])-1):
t = time.time()
s0 = s.loc[s.index.levels[0][i]]
s1 = s.loc[s.index.levels[0][i+1]]
df = pd.concat((s0, s1), axis=1)
# Note: this is slower than the line above
# df = s.loc[s.index.levels[0][0:2], :].unstack(0)
df = df.fillna('unknown')
mi = pd.MultiIndex.from_arrays((df.iloc[:, 0], df.iloc[:, 1]))
s2 = pd.Series(1, mi)
res = s2.groupby(level=[0, 1]).apply(np.sum)
results.append(res)
print(time.time() - t)
results = pd.concat(results, axis=1)
Still unclear on why the commented out section takes about three times as long as the three lines above it.