Create large nxn co-occurrence matrix in pandas dataframe - python

I am creating a 12,000 x 12,000 co-occurrence matrix for which I have around 5 million values to be inserted. My first pandas dataframe consists of the 5 million word pairs and the number of times the pair occurred:
Dataframe count_data:
word_A | word_B | count
Furthermore, I have a zero-filled 12,000 x 12,000 co-occurrence matrix as a pandas dataframe, called co_matrix. The row- and column names correspond with all the words from the word pairs. Now, I am looking for a fast way to insert all the co-occurrence counts from the first dataframe to the right 2 (!) positions in the co-occurrence dataframe. My code, that takes way too much time, is as follows:
for i in range(len(count_data)):
co_matrix.loc[count_data['word_A'][i],count_data['word_B'][i]] = co_matrix.loc[count_data['word_B'][i],count_data['word_A'][i]] = count_data['count'][i]
The double assignment ensures the symmetry of the co-occurrence matrix. How can I insert the 5 million values in the matrix in a faster way?
Thanks in advance!

Related

How to iterate pandas dataframe without memory error

I have a csv file having 140K rows. Working with pandas library.
Now the problem is I have to compare each rows with every other rows.
Now the problem is it's taking too much time.
At the same time, I am creating another column where I am appending many data for each row based on the comparison. Here I am getting memory error.
What is the optimal solution for atleast Memory error?
I am working on 12GB RAM, Google Colaboratory.
Dataframe sample:
ID x_coordinate y_coordinate
1 2 3
2 3 4
............
X 1 5
Now, I need to find distance each row with other rows and if the distance in certain threshold, I am assigning a new id for that two row which are in certain distance. So, if in my case ID 1 and ID 2 is in a certain distance I assigned a for both. And ID 2 and ID X is in certain distance I am assigning b as new matched id like below
ID x_coordinate y_coordinate Matched ID
1 2 3 [a]
2 3 4 [a, b]
............
X 1 5 [b]
For distance I am using √{(xi − xj)2 + (yi − yj)2}
Threshold can be anything. Say m unit.
This reads like you attempt to hold the complete square distance matrix in memory, which obviously doesn't scale very well, as you have noticed.
I'd suggest you to read up on how DBSCAN clustering approaches the problem, compared to e.g., hierarchical clustering:
https://en.wikipedia.org/wiki/DBSCAN#Complexity
Instead of computing all the pairwise distances at once, they seem to
put the data into a spatial database (for efficient neighborhood queries with a threshold) and then
iterate the points to identify the neighbors and the relevant distances on the fly.
Unfortunately I can't point you to readily available code or pandas functionality to support this though.

LeftAnti join in pyspark is too slow

I am trying to do some operations on pyspark. Actually I have a big dataframe (90 Million Rows, 23 columns) and another dataframe (30k rows, 1 column).
I have to remove from the first dataframe all the occurrences where the value of a certain column matches any of the values of the second dataframe.
firstdf = firstdf.join(seconddf, on = ["Field"], how = "leftanti")
The problem is that this operation is extremely slow (about 13 mins on databricks). Is there any way to improve the performance of this operation?

Whats the fastest way of do a checking and drop of two asymetrical dataframes

I have two dataframes. Dataframe A (named data2_) has 2,5 million rows and 15 columns, Dataframe B (named data) has 250 rows and 4 columns. Both have a matching column: IDENTITY.
I want to reduce Dataframe A to just the rows who match with the IDENTITY row of Dataframe B.
I tried that, but takes a lot to compute (tqdm estimates a year):
for i in tqdm(list(range(data2_.shape[0]))):
for t in list(range(data.shape[0])):
if data2_["IDENTITY"].iloc[i] != data["IDENTITY"].iloc[t]:
data2_.drop( index = i)
else:
pass

memory efficient solution for similarity calculations between items - purchases data

I'm working on product recommendations.
My dataset is as follow ( A sample, the full one is with more than 110 000 rows and more than 80000 unique product_id):
user_id product_id
0 0E3D17EA-BEEF-493 12909837
1 0FD6955D-484C-4FC8-8C3F 12732936
2 CC2877D0-A15C-4C0A Gklb38
3 b5ad805c-f295-4852 12909841
4 0E3D17EA-BEEF-493 12645715
I want to calculate the cosine similarity between products based on purchased products per user.
Why? I need to have as a final result:
the list of the 5 most similar products for each product_id.
So, I thought the 1st thing that I need to do is to convert the dataframe into this format:
where I have one row per user_id and columns are product_ids. If a user bought product_id X then the correspondant row,column will contain the value 1, otherwise 0.
I did that using crosstab function of pandas dataframe.
crosstab_df = pd.crosstab(df.user_id, df.product_id).astype('bool').astype('int')
After that, I calculated the similarities between products.
def calculate_similarity(data_items):
"""Calculate the column-wise cosine similarity for a sparse
matrix. Return a new dataframe matrix with similarities.
"""
# create a scipy sparse matrix
data_sparse = sparse.csr_matrix(data_items)
#pairwise similarities between all samples in data_sparse.transpose()
similarities = cosine_similarity(data_sparse.transpose())
#put the similarities between products in a dataframe
sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
return sim
similarity_matrix = calculate_similarity(crosstab_df)
I know that this is not efficient, because crosstab doesn't perform well when there is many rows and many columns, which is a case that I have to handle. So, I thought about instead of using a Crosstab DataFrame, I have to use scipy sparse matrix as it makes calculations faster (similarity calculations, vectors normalisation) because the input will be a numpy array, not a dataframe.
However, I didn't know how to do it. I also need to keep track of each column to what product_id it corresponds, so that I can then get the most similar product_ids to each product_id.
I found in other questions answers that:
scipy.sparse.csr_matrix(df.values)
can be used, but in my case I think, I can use it only after applying crosstab.. while I want to get rid of crosstab step.
Also, people suggested using scipy coo_matrix, but I didn't understand how can I apply it in my case, for the results I want..
I'm looking for a memory efficient solution as the initial dataset can grow for thousand of lines and hundred thousand of product_id..

Dataframe to dictionary when dealing with sparse vectors

I have a pandas dataframe with about 3000 columns.
The first column lists a category (the values can be repeated).
The second column all the way to the last column lists 1s and 0s (its somewhat of an indicator matrix). There are 20 or less 1s per row, so I am dealing with a sparse matrix.
I want to create a dictionary such that, when given a particular category, it gives you a matrix of the cosine distances of all the indicator vectors in the category (with the order from the data frame preserved). My data has about 100,000 rows as well, so I'm looking for an efficient way to do this.
Thanks

Categories

Resources