Pyspark cosinesimilarity over Dataframe

Pyspark cosinesimilarity over Dataframe - python

I have a PySpark DataFrame, df1, that looks like:
Customer1 Customer2 v_cust1 v_cust2
1 2 0.9 0.1
1 3 0.3 0.4
1 4 0.2 0.9
2 1 0.8 0.8
I want to take the cosine similarity of the two dataframes. And have something like that
Customer1 Customer2 v_cust1 v_cust2 cosine_sim
1 2 0.9 0.1 0.1
1 3 0.3 0.4 0.9
1 4 0.2 0.9 0.15
2 1 0.8 0.8 1
I have a python function that receives number/array of numbers like this:
def cos_sim(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
How can i create the cosine_sim column in my dataframe using udf?
Can i pass several columns instead of one column to the udf cosine_sim function?

It would be more efficient if you'd rather use a pandas_udf.
It performs better at vectorized operations than spark udfs: Introducing Pandas UDF for PySpark
from pyspark.sql.functions import PandasUDFType, pandas_udf
import pyspark.sql.functions as F
# Names of columns
a, b = "v_cust1", "v_cust2"
cosine_sim_col = "cosine_sim"
# Make a reserved column to fill the values since the constraint of pandas_udf
# is that the input schema and output schema has to remain the same.
df = df.withColumn("cosine_sim", F.lit(1.0).cast("double"))
#pandas_udf(df.schema, PandasUDFType.GROUPED_MAP)
def cos_sim(df):
df[cosine_sim_col] = float(np.dot(df[a], df[b]) / (np.linalg.norm(df[a]) * np.linalg.norm(df[b])))
return df
# Assuming that you want to groupby Customer1 and Customer2 for arrays
df2 = df.groupby(["Customer1", "Customer2"]).apply(cos_sim)
# But if you want to send entire columns then make a column with the same
# value in all rows and group by it. For e.g.:
df3 = df.withColumn("group", F.lit("group_a")).groupby("group").apply(cos_sim)

Related

calculate cosine similarity for all columns in a group by in a dataframe

I have a dataframe df: where APer columns range from 0-60
ID FID APerc0 ... APerc60
0 X 0.2 ... 0.5
1 Z 0.1 ... 0.3
2 Y 0.4 ... 0.9
3 X 0.2 ... 0.3
4 Z 0.9 ... 0.1
5 Z 0.1 ... 0.2
6 Y 0.8 ... 0.3
7 W 0.5 ... 0.4
8 X 0.6 ... 0.3
I want to calculate the cosine similarity of the values for all APerc columns between each row. So the result for the above should be:
ID CosSim
1 0,2,4 0.997
2 1,8,7 0.514
1 3,5,6 0.925
I know how to generate cosine similarity for the whole df:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(df)
But I want to find similarity between each ID and group them together(or create separate df). How to do it fast for big dataset?

One possible solution could be get the particular rows you want to use for cosine similarity computation and do the following.
Here, combinations is basically the list pair of row index which you want to consider for computation.
cos = nn.CosineSimilarity(dim=0)
for i in range(len(combinations)):
row1 = df.loc[combinations[i][0], 2:62]
row2 = df.loc[combinations[i][1], 2:62]
sim = cos(row1, row2)
print(sim)
The result you can use in the way you want.

create a function for calculation, then df.apply(cosine_similarity_function()), one said that using apply function may perform hundreds times faster than row by row.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html

Find event and non-event rate using pandas

I have a dataframe like as shown below
import numpy as np
import pandas as pd
np.random.seed(100)
df = pd.DataFrame({'grade': np.random.choice(list('ABCD'),size=(20)),
'dash': np.random.choice(list('PQRS'),size=(20)),
'dumeel': np.random.choice(list('QWER'),size=(20)),
'dumma': np.random.choice((1234),size=(20)),
'target': np.random.choice([0,1],size=(20))
})
I would like to do the below
a) event rate - Compute the % occurrence of 1s (from target column) for each unique value in a each of the input categorical column
b) non event rate - Compute the % occurrence of 0s (from target column) for each unique value in each of the input categorical columns
I tried the below
input_category_columns = df.select_dtypes(include='object')
df_rate_calc = pd.DataFrame()
for ip in input_category_columns:
feature,target = ip,'target'
df_rate_calc['col_name'] = (pd.crosstab(df[feature],df[target],normalize='columns'))
I would like to do this on a million rows and if there is any efficient approach, would really be helpful
I expect my output to be like as shown below. I have shown for only two columns but I want to produce this output for all categorical columns

Here is one approach:
Select the catgorical columns (cols)
Melt the dataframe with target as id variable and cols as value variables
Group the dataframe and use value_counts to calculate frequency
Unstack to reshape the dataframe
cols = df.select_dtypes('object')
df_out = (
df.melt('target', cols)
.groupby(['variable', 'target'])['value']
.value_counts(normalize=True)
.unstack(1, fill_value=0)
)
print(df_out)
target 0 1
variable value
dash P 0.4 0.3
Q 0.2 0.3
R 0.2 0.1
S 0.2 0.3
dumeel E 0.2 0.2
Q 0.1 0.0
R 0.4 0.6
W 0.3 0.2
grade A 0.4 0.2
B 0.0 0.2
C 0.4 0.3
D 0.2 0.3

correlation matrix filtering based on high variables correlation with selection of least correlated with target variable at scale using vectors

I have this resulting correlation matrix:
id
row
col
corr
target_corr
0
a
b
0.95
0.2
1
a
c
0.7
0.2
2
a
d
0.2
0.2
3
b
a
0.95
0.7
4
b
c
0.35
0.7
5
b
d
0.65
0.7
6
c
a
0.7
0.6
7
c
b
0.35
0.6
8
c
d
0.02
0.6
9
d
a
0.2
0.3
10
d
b
0.65
0.3
11
d
c
0.02
0.3
After filtering high correlated variables based on "corr" variable I
try to add new column that will compare will decide to mark "keep" the
least correlated variable from "row" or mark "drop" of that variable
for the most correlated variable "target_corr" column. In other works
from corelated variables matching cut > 0.5 select the one least correlated to
"target_corr":
Expected result:
id
row
col
corr
target_corr
drop/keep
0
a
b
0.95
0.2
keep
1
a
c
0.7
0.2
keep
2
b
a
0.95
0.7
drop
3
b
d
0.65
0.7
drop
4
c
a
0.7
0.6
drop
5
d
b
0.65
0.3
keep
This approach does use very large dataframes so resulting corr matrix for example is > 100kx100k and generated using pyspark:
def corrwith_matrix_no_save(df, data_cols=None, select_targets = None, method='pearson'):
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation
from pyspark.mllib.stat import Statistics
start_time = time.time()
vector_col = "corr_features"
if data_cols == None and select_targets == None:
data_cols = df.columns
select_target = list(df.columns)
assembler = VectorAssembler(inputCols=data_cols, outputCol=vector_col)
df_vector = assembler.transform(df).select(vector_col)
matrix = Correlation.corr(df_vector, vector_col, method)
result = matrix.collect()[0]["pearson({})".format(vector_col)].values
final_df = pd.DataFrame(result.reshape(-1, len(data_cols)), columns=data_cols, index=data_cols)
final_df = final_df.apply(lambda x: x.abs() if np.issubdtype(x.dtype, np.number) else x )
corr_df = final_df[select_target]
#corr_df.columns = [str(col) + '_corr' for col in corr_df.columns]
corr_df['column_names'] = corr_df.index
print('Execution time for correlation_matrix function:', time.time() - start_time)
return corr_df
created the dataframe from uper triagle with numpy.triuand numpy.stack + added the target column my merging 2 resulting dataframes (if code is required can provide but will increase the content a lot so will provide only if needs clarifcation).
def corrX_to_ls(corr_mtx) :
# Get correlation matrix and upper triagle
df_target = corr_mtx['target']
corr_df = corr_mtx.drop('target', inplace=True)
up = corr_df.where(np.triu(np.ones(corr_df.shape), k=1).astype(np.bool))
print('This is triu: \n', up )
df = up.stack().reset_index()
df.columns = ['row','col','corr']
df_lsDF = df.query("row" != "col")
df_target_corr = df_target.reset_index()
df_target_corr.columns = ['target_col', 'target_corr']
sample_df = df_lsDF.merge(df_target_corr, how='left', left_ob='row', right_on='target_col')
sample_df = sample_df.drop('target_col', 1)
return (sample_df)
Now after filtering resulting dataframe based on df.Corr > cut where cut > 0.50 got stuck at marking what variable o keep and what to drop
( I do look to mark them only then select into a list variables) ...
so help on solving it will be greatly appreciated and will also
benefit community when working on distributed system.
Note: Looking for example/solution to scale so I can distribute
operations on executors so lists or like a group/subset of the
dataframe to be done in parallel and avoid loops is what I do look, so
numpy.vectorize, threading and/or multiprocessing
approaches is what I do look.
Additional "thinking" from top of my mind: I do think on grouping by
"row" column so can distribute processing each group on executors or
by using lists distribute processing in parallel on executors so each
list will generate a job for each thread from ThreadPool ( I done
done this approach for column vectors but for very large
matrix/dataframes can become inefficient so for rows I think will
work).

Given final_df as the sample input, you can try:
# filter
output = final_df.query('corr>target_corr').copy()
# assign drop/keep
output['drop_keep'] = np.where(output['corr']>2*output['target_corr'],
'keep','drop')
Output:
id row col corr target_corr drop_keep
0 0 a b 0.95 0.2 keep
1 1 a c 0.70 0.2 keep
3 3 b a 0.95 0.7 drop
6 6 c a 0.70 0.6 drop
10 10 d b 0.65 0.3 keep

Multiply each column by their relative factor using prefix column name

I have a matrix like
id |v1_m1 v2_m1 v3_m1 f_m1 v1_m2 v2_m2 v3_m2 f_m2|
1 | 0 .5 .5 4 0.1 0.3 0.6 4 |
2 | 0.3 .3 .4 8 0.2 0.4 0.4 7 |
What I want is to mulply each v's in m1 by the f_m1 column, and all the v's columns with the suffix "_m2" by ghe f_m2 column.
The output that I expect is something like this:
id |v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2 |
1 | 0 2 2 0.4 1.2 2.4 |
2 | 2.4 2.4 3.2 1.4 2.8 2.8 |

for m in range (1,maxm):
for i in range (1,maxv):
df["v{}_m{}".format(i,m)] = df["v{}_m{}".format(i,m)]*df["f_m{}".format(m)]
for m in range (1,maxm):
df.drop(columns=["f_m{}".format(m)])

You could do this with some fancy dataframe reshaping:
df.columns = pd.MultiIndex.from_arrays(zip(*df.columns.str.split('_')))
df=df.stack()
df_mul = df.filter(like='v').mul(df.filter(like='f').squeeze(), axis=0)
df_mul = df_mul.unstack().sort_index(level=1, axis=1)
df_mul.columns = [f'{i}_{j}' for i, j in df_mul.columns]
df_mul
Output:
v1_m1 v2_m1 v3_m1 v1_m2 v2_m2 v3_m2
id
1 0.0 2.0 2.0 0.4 1.2 2.4
2 2.4 2.4 3.2 1.4 2.8 2.8
Details:
Create MultiIndex column headers split on '_'
Reshape dataframe stacking the m# to rows, leaving four columns f and
three v's
Using filter, we can select the v columns and multiply by the f
series created by selecting the single column and using squeeze to
create a pd.Series from a single column dataframe
unstack the m# level back to columns
Flatten the MultiIndex column header back to single level using
f-string with list comprehension.

Assuming that your matrix is a pandas dataframe called df, I would like to give my nomination for a list comprehension approach if you enjoy them.
import itertools
items = [(i[0][0],i[0][1].multiply(i[1][1]))
for i in itertools.product(df.items(),repeat=2)
if (i[0][0][-2:]==i[1][0][-2:])
and i[1][0][:1]=='f'
and i[0][0][:1]!='f']
df_mul = pd.DataFrame.from_dict({i[0]:i[1] for i in items})
It should be superfast on larger versions of this problem.
Explanation -
Creates a generator for cross-product between each column as (c1,c2) tuples
Keeps only the columns where last 2 alphabets are same for both c1,c2 AND c2 starts with 'f', AND c1 doesn't start with 'f' (leaving you with the columns you wanna operate on as individual tuples). Something like this - [('v1_m1', 'f_m1'), ('v2_m1', 'f_m1'), ('v1_m2', 'f_m2')]
Multiplies the columns, attaches a column name and saves them as items (similar structure to df.items())
Turns the items into a dataframe

Subset original dataframe based on grouped quantiles

This is my df:
NAME DEPTH A1 A2 A3 AA4 AA5 AI4 AC5 Surface
0 Ron 2800.04 8440.53 1330.99 466.77 70.19 56.79 175.96 77.83 C
1 Ron 2801.04 6084.15 997.13 383.31 64.68 51.09 154.59 73.88 C
2 Ron 2802.04 4496.09 819.93 224.12 62.18 47.61 108.25 63.86 C
3 Ben 2803.04 5766.04 927.69 228.41 65.51 49.94 106.02 62.61 L
4 Ron 2804.04 6782.89 863.88 223.79 63.68 47.69 101.95 61.83 L
... ... ... ... ... ... ... ... ... ... ...
So, my first problem has been answered here:
Find percentile in pandas dataframe based on groups
Using:
df.groupby('Surface')['DEPTH'].quantile([.1, .9])
I can get the percentiles [.1,.9] from DEPTH grouped by Surface, which is what I need:
Surface
C 0.1 2800.24
0.9 2801.84
L 0.1 3799.74
0.9 3960.36
N 0.1 2818.24
0.9 2972.86
P 0.1 3834.94
0.9 4001.16
Q 0.1 3970.64
0.9 3978.62
R 0.1 3946.14
0.9 4115.96
S 0.1 3902.03
0.9 4073.26
T 0.1 3858.14
0.9 4029.96
U 0.1 3583.01
0.9 3843.76
V 0.1 3286.01
0.9 3551.06
Y 0.1 2917.00
0.9 3135.86
X 0.1 3100.01
0.9 3345.76
Z 0.1 4128.56
0.9 4132.56
Name: DEPTH, dtype: float64
Now, I believe that was already the hardest part. What is left is subsetting the original df to include only the values in between those DEPTH percentiles .1 & .9. So for example: DEPTH values in Surface group "Z" have to be greater than 4128.56 and less than 4132.56.
Note that I need df again, not df.groupby("Surface"): the final df would be exactly the same, but the rows whose depths are outside the borders should be dropped.
This seems so easy ... any ideas?
Thanks!

When you need to filter rows within groups it's often simpler and faster to use groupby + transform to broadcast the result to every row within a group and then filter the original DataFrame. In this case we can check if 'DEPTH' is between those two quantiles.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'DEPTH': np.random.normal(0,1,100),
'Surface': np.random.choice(list('abcde'), 100)})
Code
gp = df.groupby('Surface')['DEPTH']
df1 = df[df['DEPTH'].between(gp.transform('quantile', 0.1),
gp.transform('quantile', 0.9))]
For clarity, here you can see that transform will broadcast the scalar result to every row that belongs to the group, in this case defined by 'Surface'
pd.concat([df['Surface'], gp.transform('quantile', 0.1).rename('q = 0.1')], axis=1)
# Surface q = 0.1
#0 a -1.164557
#1 e -0.967809
#2 a -1.164557
#3 c -1.426986
#4 b -1.544816
#.. ... ...
#95 a -1.164557
#96 e -0.967809
#97 b -1.544816
#98 b -1.544816
#99 b -1.544816
#
#[100 rows x 2 columns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark cosinesimilarity over Dataframe - python

Related

calculate cosine similarity for all columns in a group by in a dataframe

Find event and non-event rate using pandas

correlation matrix filtering based on high variables correlation with selection of least correlated with target variable at scale using vectors

Multiply each column by their relative factor using prefix column name

Subset original dataframe based on grouped quantiles

Categories

Resources