Beginner pySpark question here:
How do I find the indices where all vectors are zero?
After a series of transformations, I have a spark df with ~2.5M rows and a tfidf Sparse Vector of length ~262K. I would like to perform PCA dimensionality reduction to make this data more manageable for multi-layer perceptron model fitting, but pyspark's PCA is limited to a max of 65,535 columns.
+--------------------+
| tfidf_features| df.count() >>> 2.5M
+--------------------+ Example Vector:
|(262144,[1,37,75,...| SparseVector(262144, {7858: 1.7047, 12326: 1.2993, 15207: 0.0953,
|(262144,[0],[0.12...| 24112: 0.452, 40184: 1.7047,...255115: 1.2993, 255507: 1.2993})
|(262144,[0],[0.12...|
|(262144,[0],[0.12...|
|(262144,[0,6,22,3...|
+--------------------+
Therefore, I would like to delete the indicies or columns of the sparse tfidf vector that are zero for all ~2.5M documents (rows). This will hopefully get me under the 65,535 maximum for PCA.
My plan is to to create a udf that (1) converts the Sparse Vectors to Dense Vectors (or np arrays) (2) searches all Vectors to find indices where all Vectors are zero (3) delete the index. However, I am struggling with the second part (finding the indices where all vectors equal zero). Here's where I am so far, but I think my plan of attack is way too time consuming and not very pythonic (especially for such a big dataset):
import numpy as np
row_count = df.count()
def find_zero_indicies(df):
vectors = df.select('tfidf_features').take(row_count)[0]
zero_indices = []
to_delete = []
for vec in vectors:
vec = vec.toArray()
for value in vec:
if value.nonzero():
zero_indices.append(vec.index(value))
for value in zero_indices:
if zero_inices.count(value) == row_count:
to_delete.append(value)
return to_delete
Any advice or help appreciated!
If anything, it makes more sense to find indices which should be preserved:
from pyspark.ml.linalg import DenseVector, SparseVector
from pyspark.sql.functions import explode, udf
from operator import itemgetter
#udf("array<integer>")
def indices(v):
if isinstance(v, DenseVector):
return [i for i in range(len(v))]
if isinstance(v, SparseVector):
return v.indices.tolist()
return []
indices_list = (df
.select(explode(indices("tfidf_features")))
.distinct()
.rdd.map(itemgetter(0))
.collect())
and use VectorSlicer:
from pyspark.ml.feature import VectorSlicer
slicer = VectorSlicer(
inputCol="tfidf_features",
outputCol="tfidf_features_subset", indices=indices_list)
slicer.transform(df)
However in practice I would recommend using fixed size vector, either with HashingTF:
HashingTF(inputCol="words", outputCol="tfidf_features", numFeatures=65535)
or CountVectorizer:
CountVectorizer(inputCol="words", outputCol="vectorizer_features",
vocabSize=65535)
In both cases you can combine it with StopWordsRemover.
Related
I have a csv file with 10 columns. I can use pandas to import the dataframe and use the corr() function to output a matrix heatmap. What I want to achieve next is for the code to loop through the dataframe and find high or low correlations between combinations of columns
For example, the simple correlation matrix looks at:
A:A, A:B, A:C, A:D etc
But I want the code to combine columns, in every conceivable way, such as:
AB:A, AB:B, AB:C, AB: D etc
ABC:A, ABC:B, ABC:D etc
And if there are any noticeable correlations between certain combinations, to highlight those.
Is this possible at all? Or are there proprietary applications that can do this?
Thanks
I assume with "combination" you mean linear combination. You can loop over the columns (not the most elegant way) and use sklearn linear_model
import pandas as pd
import numpy as np
from sklearn import linear_model
df = pd.DataFrame(np.random.random([10,10]),columns=['A','B','C','D','E','F','G','H','I','J'])
for i,col1 in enumerate(df):
if i > 0:
X = df.iloc[:,0:i]
for j,col2 in enumerate(df):
if j >= i:
y = df[[col2]]
regr = linear_model.LinearRegression()
regr.fit(X, y)
score = regr.score(X,y)
print(f'X: {X.columns} y: {y.columns} score:{score}')
I am working on creating a function which will calculate the cosine similarity of each record in a dataset (MxK dimension) against records in another dataset (NxK dimension) where N is much smaller than M.
The below code does the job well when I test it on a tiny dataset ('iris' dataset for example). I am worried it might struggle when I have bigger datasets ( 100K records & 100+ variables).
I know for loop is not advisable for such scenarios and I got two for loops in this case. I am wondering if anyone can suggest ways of improving this code.
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def similarity_calculation(seed_data, pool_data):
# Create an empty dataframe to store the similarity scores
similarity_matrix = pd.DataFrame()
for indexi, rowi in pool_data.iterrows():
# Create an array to score similarity score for each record in pool data
similarity_score_array = []
for indexj, rowj in seed_data.iterrows():
# Fetch a single record from pool dataset
pool = rowi.values.reshape(1, -1)
# Fetch a single record from seed dataset
seed = rowj.values.reshape(1, -1)
# Measure similarity score between the two records
similarity_score = (cosine_similarity(pool, seed))[0][0]
similarity_score_array.append(similarity_score)
# Append the similarity score array as a new record to the similarity matrix
similarity_matrix = similarity_matrix.append(pd.Series(similarity_score_array), ignore_index=True)
Edit1: Sample data iris dataset is used as follows
iris_data = pd.read_csv("iris_data.csv", header=0)
# Split the data into seeds and pool sets, excluding the species details
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
Expected result is
My new compact code (with a single for loop) is as follows
def similarity_calculation_compact(seed_data, pool_data):
Array1 = pool_data.values
Array2 = seed_data.values
scores = []
for i in range(Array1.shape[0]):
scores.append(np.mean(cosine_similarity(Array1[None, i, :], Array2)))
final_data = pool_data.copy()
final_data['mean_similarity_score'] = scores
final_data = final_data.sort_values(by='mean_similarity_score', ascending=False)
return(final_data)
The output I am getting is
I was expecting identical results as both functions are supposed to fetch records from pool data most similar (in terms of average cosine similarity) to the seed data.
There is no need for the for-loops, since cosine_similarity takes as input two arrays of shapes (n_samples_X, n_features) and (n_samples_Y, n_features) and returns an array of shape (n_samples_X, n_samples_Y) by computing cosine similarity between each pair of the two input arrays.
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
iris_data = pd.read_csv("iris.csv", header=0)
seed_set = iris_data.iloc[:10, :4]
pool_set = iris_data.iloc[10:, :4]
np.mean(cosine_similarity(pool_set, seed_set), axis=1)
Result (after sorting):
array([0.99952255, 0.99947777, 0.99947545, 0.99946886, 0.99946596, ...])
I'm trying to normalize data with missing (i.e. nan) values before processing it, using scikit-learn preprocessing.
Apparently, some scalers (e.g. StandardScaler) handle the missing values the way I want - by which I mean normalize the existing values while keeping the nans - while other (e.g. Normalizer) just raise an error.
I've looked around and haven't found - how can I use the Normalizer with missing values, or replicate its behavior (with norm='l1' and norm='l2'; I need to test several normalization options) some other way?
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
scaler = StandardScaler(with_mean=True, with_std=True)
scaler.fit_transform(data.reshape(-1,1))
normalizer = Normalizer(norm='l2')
normalizer.fit_transform(data.reshape(-1,1))
The problem with your request is that Normalizer operates in this fashion, accordingly to documentation:
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non
zero component is rescaled independently of other samples so that its
norm (l1 or l2) equals one (source here)
That means that each row have to sum to unit norm. How to deal with a missing value? Ideally it seems you don't want it to count in the sum and you want the row to normalize regardless of it, but the internal function check_array prevents from it by throwing an error.
You need to circumvent such a situation. The most reasonable way to do it is to:
first create a mask in order to record which elements were missing in your array
create a response array filled with missing values
apply the Normalizer to your array after selecting only the valid entries
record on your response array the normalized values based on their original position
here some code detailing the process, based on your example:
from sklearn.preprocessing import Normalizer, StandardScaler
import numpy as np
data = np.array([0,1,2,np.nan, 3,4])
# set valid mask
nan_mask = np.isnan(data)
valid_mask = ~nan_mask
normalizer = Normalizer(norm='l2')
# create a result array
result = np.full(data.shape, np.nan)
# assign only valid cases to
result[valid_mask] = normalizer.fit_transform(data[valid_mask].reshape(-1,1)).reshape(data[valid_mask].shape)
I have a spark dataframe using which I am calculating the Euclidean distance between a row and a given set of corrdinates. I am recreating a structurally similar dataframe 'df_vector' here to explain better.
from pyspark.ml.feature import VectorAssembler
arr = [[1,2,3], [4,5,6]]
df_example = spark.createDataFrame(arr, ['A','B','C'])
assembler = VectorAssembler(inputCols=[x for x in df_example.columns],outputCol='features')
df_vector = assembler.transform(df_example).select('features')
>>> df_vector.show()
+-------------+
| features|
+-------------+
|[1.0,2.0,3.0]|
|[4.0,5.0,6.0]|
+-------------+
>>> df_vector.dtypes
[('features', 'vector')]
As you can see the features column is a vector. In practice, I get this vector column as the output of a StandardScaler. Anyway, since I need to calculate Euclidean distance, I do the following
rdd = df_vector.select('features').rdd.map(lambda r: np.linalg.norm(r-b))
where
b = np.asarray([0.5,1.0,1.5])
I have all the calculations I need but I need this rdd as a column in df_vector. How do I go about it?
Instead of creating a new rdd, you could use an UDF:
norm_udf = udf(lambda r: np.linalg.norm(r - b).tolist(), FloatType())
df_vector.withColumn("norm", norm_udf(df.features))
Make sure numpy is defined on the worker nodes.
One way to tackle performance issues might be to use mapPartitions. The idea would be, at a partition level, to convert features to an array and then calculate the norm on the whole array (thus implicitly using numpy vectorisation). Then do some housekeeping to get the form you want. For large datasets this might improve performance:
Here is the function which calculates the norm at partition level:
from pyspark.sql import Row
def getnorm(vectors):
# convert vectors into numpy array
vec_array=np.vstack([v['features'] for v in vectors])
# calculate the norm
norm=np.linalg.norm(vec_array-b, axis=1)
# tidy up to get norm as a column
output=[Row(features=x, norm=y) for x,y in zip(vec_array.tolist(), norm.tolist())]
return(output)
Applying this using mapPartitions gives an RDD of Rows which can then be converted to a DataFrame:
df_vector.rdd.mapPartitions(getnorm).toDF()
I have a bunch of 14784 text documents, which I am trying to vectorize, so I can run some analysis. I used the CountVectorizer in sklearn, to convert the documents to feature vectors. I did this by calling:
vectorizer = CountVectorizer
features = vectorizer.fit_transform(examples)
where examples is an array of all the text documents
Now, I am trying to use additional features. For this, I am storing the features in a pandas dataframe. At present, my pandas dataframe(without inserting the text features) has the shape (14784, 5). The shape of my feature vector is (14784, 21343).
What would be a good way to insert the vectorized features into the pandas dataframe?
Return term-document matrix after learning the vocab dictionary from the raw documents.
X = vect.fit_transform(docs)
Convert sparse csr matrix to dense format and allow columns to contain the array mapping from feature integer indices to feature names.
count_vect_df = pd.DataFrame(X.todense(), columns=vect.get_feature_names_out())
Concatenate the original df and the count_vect_df columnwise.
pd.concat([df, count_vect_df], axis=1)
If your base data frame is df, all you need to do is:
import pandas as pd
features_df = pd.DataFrame(features)
combined_df = pd.concat([df, features_df], axis=1)
I'd recommend some options to reduce the number of features, which could be useful depending on what type of analysis you're doing. For example, if you haven't already, I'd suggest looking into removing stop words and stemming. Additionally you can set max_features, like features = vectorizer.fit_transform(examples, max_features = 1000) to limit the number of features.