I have a pandas dataframe with about 3000 columns.
The first column lists a category (the values can be repeated).
The second column all the way to the last column lists 1s and 0s (its somewhat of an indicator matrix). There are 20 or less 1s per row, so I am dealing with a sparse matrix.
I want to create a dictionary such that, when given a particular category, it gives you a matrix of the cosine distances of all the indicator vectors in the category (with the order from the data frame preserved). My data has about 100,000 rows as well, so I'm looking for an efficient way to do this.
Thanks
Related
I have a dataframe with several columns, including Department and ICA. I need to create a matrix where the rows are the departments, and the columns are ICA values (they range from bad-acceptable-good).
So position r,c would be a number that shows how many observations of ICA were recorded for each department.
For example, if Amazonas is row 1 and Acceptable is column 3, position (1,3) would be the number of acceptable observations for Amazonas.
Thanks!
You can get values from your DataFrame using integer-based indexing with the DataFrame.iloc method. This seems to do what you need.
For example, if df is your DataFrame, then df.iloc[0, 2] will give you the value at the first row and third column.
In the picture I plot the values from an array of shape (400,8)
I wish to reorganize the points in order to get 8 series of "continuous" points. Let's call them a(t), b(t), .., h(t). a(t) being the serie with the smaller values and h(t) the serie with the bigger value. They are unknown and I try to obtain them
I have some missing values replaced by 0.
When there is a 0, I do not know to which serie it belongs to. The zeros are always stored with high index in the array
For instance at time t=136 I have only 4 values that are valid. Then array[t,i] > 0 for i <=3 and array[t,i] = 0 for i > 3
How can I cluster the points in a way that I get "continuous" time series i.e. at time t=136, array[136,0] should go into d, array[136,1] should go into e, array[136,2] should go into f and array[136,3] should go into g
I tried AgglomerativeClustering and DBSCAN with scikit-learn with no success.
Data are available at https://drive.google.com/file/d/1DKgx95FAqAIlabq77F9f-5vO-WPj7Puw/view?usp=sharing
My interpretation is that you mean that you have the data in 400 columns and 8 rows. The data values are assigned to the correct columns, but not necessarily to the correct rows. Your figure shows that the 8 signals do not cross each other, so you should be able to simply sort each column individually. But now the missing data is the problem, because the zeros representing missing data will all sort to the bottom rows, forcing the real data into the wrong rows.
I don't know if this is a good answer, but my first hunch is to start by sorting each column individually, then beginning in a place where there are several adjacent columns with full spans of real data, and working away from that location first to the left and then to the right, one column at a time: If the column contains no zeros, it is OK. If it contains zeros, then compute local row averages of the immediately adjacent columns, using only non-zero values (the number of columns depends on the density of missing data and the resolution between the signals), and then put each valid value in the current column into the row with the closest 'local row average' value, and put zeros in the remaining rows. How to code that depends on what you have done so far. If you are using numpy, then it would be convenient to first convert the zeros to NaN's, because numpy.nanmean() will ignore the NaN's.
I have a large DataFrame with 2 million observations. For my further analysis, I intend to use a relatively smaller sample (around 15-20% of the original DataFrame) drawn from the original DataFrame. While sampling, I also intend to keep the proportion of categorical values present in one of the columns intact.
For eg: if one column has 5 categories as its values: red(20% of total observations), green(10%), blue(15%), white(25%), yellow(30%); I would like the column in the sample dataset to also show the same proportion of different categories.
Please assist!
I'm working on product recommendations.
My dataset is as follow ( A sample, the full one is with more than 110 000 rows and more than 80000 unique product_id):
user_id product_id
0 0E3D17EA-BEEF-493 12909837
1 0FD6955D-484C-4FC8-8C3F 12732936
2 CC2877D0-A15C-4C0A Gklb38
3 b5ad805c-f295-4852 12909841
4 0E3D17EA-BEEF-493 12645715
I want to calculate the cosine similarity between products based on purchased products per user.
Why? I need to have as a final result:
the list of the 5 most similar products for each product_id.
So, I thought the 1st thing that I need to do is to convert the dataframe into this format:
where I have one row per user_id and columns are product_ids. If a user bought product_id X then the correspondant row,column will contain the value 1, otherwise 0.
I did that using crosstab function of pandas dataframe.
crosstab_df = pd.crosstab(df.user_id, df.product_id).astype('bool').astype('int')
After that, I calculated the similarities between products.
def calculate_similarity(data_items):
"""Calculate the column-wise cosine similarity for a sparse
matrix. Return a new dataframe matrix with similarities.
"""
# create a scipy sparse matrix
data_sparse = sparse.csr_matrix(data_items)
#pairwise similarities between all samples in data_sparse.transpose()
similarities = cosine_similarity(data_sparse.transpose())
#put the similarities between products in a dataframe
sim = pd.DataFrame(data=similarities, index= data_items.columns, columns= data_items.columns)
return sim
similarity_matrix = calculate_similarity(crosstab_df)
I know that this is not efficient, because crosstab doesn't perform well when there is many rows and many columns, which is a case that I have to handle. So, I thought about instead of using a Crosstab DataFrame, I have to use scipy sparse matrix as it makes calculations faster (similarity calculations, vectors normalisation) because the input will be a numpy array, not a dataframe.
However, I didn't know how to do it. I also need to keep track of each column to what product_id it corresponds, so that I can then get the most similar product_ids to each product_id.
I found in other questions answers that:
scipy.sparse.csr_matrix(df.values)
can be used, but in my case I think, I can use it only after applying crosstab.. while I want to get rid of crosstab step.
Also, people suggested using scipy coo_matrix, but I didn't understand how can I apply it in my case, for the results I want..
I'm looking for a memory efficient solution as the initial dataset can grow for thousand of lines and hundred thousand of product_id..
I am creating a 12,000 x 12,000 co-occurrence matrix for which I have around 5 million values to be inserted. My first pandas dataframe consists of the 5 million word pairs and the number of times the pair occurred:
Dataframe count_data:
word_A | word_B | count
Furthermore, I have a zero-filled 12,000 x 12,000 co-occurrence matrix as a pandas dataframe, called co_matrix. The row- and column names correspond with all the words from the word pairs. Now, I am looking for a fast way to insert all the co-occurrence counts from the first dataframe to the right 2 (!) positions in the co-occurrence dataframe. My code, that takes way too much time, is as follows:
for i in range(len(count_data)):
co_matrix.loc[count_data['word_A'][i],count_data['word_B'][i]] = co_matrix.loc[count_data['word_B'][i],count_data['word_A'][i]] = count_data['count'][i]
The double assignment ensures the symmetry of the co-occurrence matrix. How can I insert the 5 million values in the matrix in a faster way?
Thanks in advance!