I have a 300.000 row pd.DataFrame comprised of multiple columns, out of which, one is a 50-dimension numpy array of shape (1,50) like so:
ID Array1
1 [2.4252 ... 5.6363]
2 [3.1242 ... 9.0091]
3 [6.6775 ... 12.958]
...
300000 [0.1260 ... 5.3323]
I then generate a new numpy array (let's call it array2) with the same shape and calculate the cosine similarity between each row of the dataframe and the generated array. For this, I am currently using sklearn.metrics.pairwise.cosine_similarity and save the results in a new column:
from sklearn.metrics.pairwise import cosine_similarity
df['Cosine'] = cosine_similarity(df['Array1].tolist(), array2)
Which works as intended and takes, on average, 2.5 seconds to execute. I am currently trying to lower this time to under 1 second simply for the sake of having less waiting time in the system I am building.
I am beginning to learn about Vaex and Dask as alternatives to pandas but am failing to convert the code I provided to a working equivalent that is also faster.
Preferably with one of the technologies I mentioned, how can I go about making pairwise cosine calculations even faster for large datasets?
You could use Faiss here and apply a knn operation. To do this, you would put dataframe into a Faiss index and then search it using the array with k=3000000 (or whatever the total number of rows of your dataframe).
import faiss
dimension = 100
array1 = np.random.random((n, dimension)).astype('float32')
index = faiss.IndexFlatIP(d)
#add the rows of the dataframe into Faiss
for index, row in df.iterrows():
index.add(row)
k= len(df)
D, I = index.search(array1, k)
Note that you'll need to normalise the vectors to make this work (as the above solution is based on inner product).
Related
I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.
def calculate_dfA(df_t,xout):
df_A = df_t.copy()
vector_x = xout.T
for index_col, column in tqdm(df_A.iteritems()):
for index_row, row in df_A.iterrows():
df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]
return(df_A)
The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?
You need to vectorize your division:
result = df_A.values/vector_x
This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.
Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.
Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.
I'm on mobile now but you should try to avoid every for loop in python - theres always a better way
For one I know you can multiply a pandas column (Series) times a column to get your desired result.
I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)
I would strongly recommend that you temporarily convert to a numpy ndarray and work with these
I have a dataframe that has 2 columns of zipcodes, I would like to add another column with their distance values, I am able to do this with a fairly low number of rows, but I am now working with a dataframe that has about 500,000 rows for calculations. The code I have works, but on my current dataframe it's been about 30 minutes of running, and still no completion, so I feel what i'm doing is extremely inefficient.
Here is the code
import pgeocode
dist = pgeocode.GeoDistance('us')
def distance_pairing(start,end):
return dist.query_postal_code(start, end)
zips['distance'] = zips.apply(lambda x: distance_pairing(x['zipstart'], x['zipend']), axis=1)
zips
I know looping is out of the question, so is there something else I can do, efficiency wise that would make this better?
Whenever possible, use vectorized operations in pandas and numpy. In this case:
zips['distance'] = dist.query_postal_code(
zips['zipstart'].values,
zips['zipend'].values,
)
This won't always work, but in this case, the underlying pgeocode.haversine function is written (in numpy) to accommodate arrays of x and y coordinates. This should speed up your code by several orders of magnitude for a dataframe of this size.
As shown in this question Calculating rolling correlation of pandas dataframes , I need to get a correlation of an array of length N to each window in a second array length M.
x= np.random.randint(0,100,10000)
y= [4,5,4,5]
corrs = []
for i in range(0,(len(x)-len(y) ) +1):
corrs.append( np.corrcoef(x[i:i+4],y)[0,1] )
Every question I find that is similar to this discusses how to do it on a matrix for NxK to MxK. However the ones I try are not working for 1d data. In the linked question, the suggest is to roll over the pandas frame, which is pretty slow. Is there a faster way to calculate this?
The above code takes around 0.4s and the code from the example link takes 1.6s:
corr = x.rolling(4).apply(lambda x: np.corrcoef(x,y)[0,1],raw=False ).dropna(how='all',axis=0)
Is there a much more efficient way to do this?
Store your correlation coefficients in a numpy array instead of a regular python list (you are resizing the list every time you insert an element)
corrs = np.zeros([len(x)-len(y)+1])
for i in range(0,(len(x)-len(y) ) +1):
corrs[i] = np.corrcoef(x[i:i+4],y)[0,1]
I'm using sklearn's pairwise distance function, which saved my life when computing a huge matrix, but the problem I'm having is that I lose my indices.
Specifically, I initially have a huge dataframe of 17000 x 300, which I break down into 4 different dataframes based on some class condition.
The 4 separate dataframes keep the original indices, but after I run the pairwise distance function on one of those dataframes, it gives me back a 2d array with correct values but the indices have been reset from 0 up.
How do I keep or recover the original indices?
distance1 = pair.pairwise_distances(df1, metric='euclidean')
You can create a DataFrame with matching indices using the DataFrame constructor taking the index parameter:
pd.DataFrame(distance1, index=df1.index)
Furthermore, if you would like to concatenate it horizontally to your existing DataFrame, you can use
pd.concat((df1, pd.DataFrame(distance1, index=df1.index)), axis=1)
There is a SFrame with columns having dict elements.
import graphlab
import numpy as np
a = graphlab.SFrame({'col1':[{'oshan':3,'modi':4},{'ravi':1,'kishan':5}],
'col2':[{'oshan':1,'rawat':2},{'hari':3,'kishan':4}]})
I want to calculate cosine distance between these two columns for each row of the SFrame. Below is the operation using for loop.
dis = np.zeros(len(a),dtype = float)
for i in range(len(a)):
dis[i] = graphlab.distances.cosine(a['col1'][i],a['col2'][i])
a['distance12'] = dis
This is very inefficient and would take hours if the number of rows was large. Could someone please suggest a better approach.
You can usually avoid looping over an SFrame by using the apply function. In your case, it would look like this:
a.apply(lambda row: graphlab.distances.cosine(row['col1'], row['col2']))
That should be significantly faster than looping in Python.