memory efficient solution for similarity calculations between items - purchases data - python
I'm working on product recommendations.
My dataset is as follow ( A sample, the full one is with more than 110 000 rows and more than 80000 unique product_id):
user_id product_id
0 0E3D17EA-BEEF-493 12909837
1 0FD6955D-484C-4FC8-8C3F 12732936
2 CC2877D0-A15C-4C0A Gklb38
3 b5ad805c-f295-4852 12909841
4 0E3D17EA-BEEF-493 12645715
I want to calculate the cosine similarity between products based on purchased products per user.
Why? I need to have as a final result:
the list of the 5 most similar products for each product_id.
So, I thought the 1st thing that I need to do is to convert the dataframe into this format:
where I have one row per user_id and columns are product_ids. If a user bought product_id X then the correspondant row,column will contain the value 1, otherwise 0.
I did that using crosstab function of pandas dataframe.
crosstab_df = pd.crosstab(df.user_id, df.product_id).astype('bool').astype('int')
After that, I calculated the similarities between products.
def calculate_similarity(data_items):
"""Calculate the column-wise cosine similarity for a sparse
matrix. Return a new dataframe matrix with similarities.
"""
# create a scipy sparse matrix
data_sparse = sparse.csr_matrix(data_items)
#pairwise similarities between all samples in data_sparse.transpose()
similarities = cosine_similarity(data_sparse.transpose())
#put the similarities between products in a dataframe
sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
return sim
similarity_matrix = calculate_similarity(crosstab_df)
I know that this is not efficient, because crosstab doesn't perform well when there is many rows and many columns, which is a case that I have to handle. So, I thought about instead of using a Crosstab DataFrame, I have to use scipy sparse matrix as it makes calculations faster (similarity calculations, vectors normalisation) because the input will be a numpy array, not a dataframe.
However, I didn't know how to do it. I also need to keep track of each column to what product_id it corresponds, so that I can then get the most similar product_ids to each product_id.
I found in other questions answers that:
scipy.sparse.csr_matrix(df.values)
can be used, but in my case I think, I can use it only after applying crosstab.. while I want to get rid of crosstab step.
Also, people suggested using scipy coo_matrix, but I didn't understand how can I apply it in my case, for the results I want..
I'm looking for a memory efficient solution as the initial dataset can grow for thousand of lines and hundred thousand of product_id..
Related
Calculate the percentage difference between two specific rows in Python pandas
The problem is that I am trying to run a specific row I choose to calculate what percentage the specific row value is away from the intended outputs mean (which is already calculated from another column), to find what percentage it deviates from the intended outputs mean. I want to run each item individually like so: Below I made a dataframe column to store the result df['pct difference'] = ((df['tertiary_tag']['price'] - df['ab roller']['mean'])/df['ab roller']['mean']) * 100 For example, let's say the mean is 10 and I know that the item is 8 dollars, figure out whatever percentage away from the mean that product is and return that number for each items of that dataset. Return what percentage it deviates from the mean. Keep in mind, the problem is not solved by a loop because I am sure pandas has something more practical to calculate the % difference not pct_change. I also thought maybe to get the very specific row make a column as some indexing so I can use, and use that to access any row with in the columns by using that index and from indexing do my operation whatever you want for example to calculate difference in percentage between two rows? I thought maybe through indexing the column of the price? df = df.set_index(['price']) df.index = pd.to_datetime(df.index)
def percent_diff(df, row1, row2): """ Calculating the percentage difference between two specific rows in a dataframe """ return (df.loc[row1, 'value'] - df.loc[row2, 'value']) / df.loc[row2, 'value'] * 100
Pandas GroupBy for row number
I have enormous arrays of time-series data (~3GB, millions of rows) that I load into a dataframe through numpy memmap. I'd like to summarize them by getting descriptive statistics for each group of n elements (say 1000 per group). I really like the combination of group_by and describe but it seems like group_by is only useful for categorical data. If it wasn't a memmap I could add another column for time interval categories. Is there a way to get a GroupBy object that I can use describe on where the groups are by row index?
Correlation between rows in a pandas dataframe
I have this dataframe. I would like to find a way to make a correlation matrix between an Hour and the same hour of the day before (for example H01 of 28/09 vs H01 of 27/09). I thought about two different approaches: 1) Do the corr matrix of the transpose dataframe. dft=df.transpose() dft.corr() 2) create a copy of the dataframe with 1 day/rows of lag and than do .corrwith() in order to compare them. In the first approach I obtain weird results (for example rows like 634 and 635 low correlated even if they have values very similar), in the second approach I obtain all ones. I'm ideally looking forward to find the correlation in days close to eachothers actually. Send help please
Detecting bad information (python/pandas)
I am new to python and pandas and I was wondering if I am able to have pandas filter out information within a dataframe that is otherwise inconsistent. For example, imagine that I have a dataframe with 2 columns, (1) product code and (2) unit of measurement. The same product code in column 1 may repeat several times and there would be several different product codes, I would like to filter out the product codes for which there is more than 1 unit of measurement for the same product code. Ideally, when this happen the filter would bring all instances of such product code, not just the instance in which the unit of measurement is different. To put more color to my request, the real objective here is to identify the product codes which have inconsistent unit of measurements, as the same product code should always have the same unit of measurement in all instances. Thanks in advance!!
First you want some mapping of product code -> unit of measurement, ie the ground truth. You can either upload this, or try to be clever and derive it from the data assuming that the most frequently used unit of measurement for product code is the correct one. You could get this by doing truth_mapping = df.groupby(['product_code'])['unit_of_measurement'].agg(lambda x:x.value_counts().index[0]).to_dict() Then you can get a column that is the 'correct' unit of measurement df['correct_unit'] = df['product_code'].apply(truth_mapping.get) Then you can filter to rows that do not have the correct mapping: df[df['correct_unit'] != df['unit_of_measurement']]
Try this: Sample df: df12= pd.DataFrame({'Product Code':['A','A','A','A','B','B','C','C','D','E'], 'Unit of Measurement':['x','x','y','z','w','w','q','r','a','c']}) Group by and see count of all non unique pairs: new = df12.groupby(['Product Code','Unit of Measurement']).size().reset_index().rename(columns={0:'count'}) Drop all rows where the Product Code is repeated new.drop_duplicates(subset=['Product Code'], keep=False)
Dataframe to dictionary when dealing with sparse vectors
I have a pandas dataframe with about 3000 columns. The first column lists a category (the values can be repeated). The second column all the way to the last column lists 1s and 0s (its somewhat of an indicator matrix). There are 20 or less 1s per row, so I am dealing with a sparse matrix. I want to create a dictionary such that, when given a particular category, it gives you a matrix of the cosine distances of all the indicator vectors in the category (with the order from the data frame preserved). My data has about 100,000 rows as well, so I'm looking for an efficient way to do this. Thanks