Pyspark Py4j IllegalArgumentException with spark.createDataFrame and pyspark.ml.clustering

Pyspark Py4j IllegalArgumentException with spark.createDataFrame and pyspark.ml.clustering - python

Let me disclose the full background of my problem first, I'll have a simplified MWE that recreates the same issue at the bottom. Feel free to skip me rambling about my setup and go straight to the last section.
The Actors in my Original Problem:
A spark dataframe data read from Amazon S3, with a column scaled_features that ultimately is the result of a VectorAssembler operation followed by a MinMaxScaler.
A spark dataframe column pca_features that results from the above df column after a PCA like so:
mat = RowMatrix(data.select('scaled_features').rdd.map(list))
pc = mat.computePrincipalComponents(2)
projected = mat.multiply(pc).rows.map(lambda x: (x, )).toDF().withColumnRenamed('_1', 'pca_features')
Two instances of BisectingKMeans fitting to both instances of features in the abovementioned data frames like so:
kmeans_scaled = BisectingKMeans(featuresCol='scaled_features').setK(4).setSeed(1)
model1 = kmeans_scaled.fit(data)
kmeans_pca = BisectingKMeans(featuresCol='pca_features').setK(4).setSeed(1)
model2 = kmeans_pca.fit(projected)
The Issue:
While BisectingKMeans fits to scaled_features from my first df without issues, when attempting a fit to the projected features, it errors out with the following
Py4JJavaError: An error occurred while calling o1413.fit.
: java.lang.IllegalArgumentException: requirement failed: Column features must be of type equal to one of the following types:
[struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>]
but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
As you can see, Py4J complains that I'm passing data in a certain struct type that happens to be the first type specified in the list of allowed types.
Additional Debug Info:
My Spark is running version 2.4.0
Checking the dtypes yields: data.dtypes: [('scaled_features', 'vector')] and projected.dtypes: [('pca_features', 'vector')]. The Schema is the same for both dataframes as well, printing just one for reference:
root
|-- scaled_features: vector (nullable = true)
Recreating the error (MWE):
It turns out that this same error can be recreated by creating a simple data frame from some Vectors (the columns in my original dfs are of VectorType as well):
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeans
test_data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])),
Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
kmeans_test = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model3 = kmeans_test.fit(test_data)
The last line results in the same error I'm facing in my original setup.
Can anyone explain this error and suggest a way to rectify it?

After a few more days of investigation, I was pointed to the (rather embarrassing) cause of the issue:
Pyspark has two machine learning libraries: pyspark.ml and pyspark.mllib and it turns out they don't go well together. Replacing from pyspark.mllib.linalg import DenseVector by from pyspark.ml.linalg import DenseVector resolves all the issues.

Related

Does this function computeSVD use MapReduce in Pyspark

Does computeSVD() use map , reduce
since it is a predefined function?
i couldn't know the code of the function.
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True) <------------- this function
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.

It does, from Spark documentation
This page documents sections of the MLlib guide for the RDD-based API (the spark.mllib package). Please see the MLlib Main Guide for the DataFrame-based API (the spark.ml package), which is now the primary API for MLlib.
If you want to look at code base, here it is https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L328

Creating vectors.dense, and sparse.dense, are they identical?

I am trying to make more sense of these two types, so i am creating these 2 arrays to see if i am doing it right. What i am doing now is creating 2 identical arrays, my goal is:
dv = [1.0, 0.0, 3.0]
sv = [1.0, 0.0, 3.0]
So i wrote the syntax below,
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.getOrCreate()
dv = Vectors.dense(1.0, 0.0, 3.0)
sv = Vectors.sparse(3, [(0,2), (1.,3.)])
Therefore, my first question is, is my syntax correct for achieving my goal?
my second question is, when i print them,
print(dv)
print(sv)
they return:
[1.0,0.0,3.0]
(3,[0,1],[2.0,3.0])
so, how do i show the "real" array of sv? like not in this "Vectors.dense? form?

The creation of the sparse vector is slightly incorrect. From the docs: the second and third parameter should be
two sorted lists containing indices and values
This gives
sv = Vectors.sparse(3, [0,2], [1.,3])
To transform the vectors into an arrays the function vector_to_array can be used.
from pyspark.sql import functions as F
from pyspark.ml.functions import vector_to_array
spark.createDataFrame([(dv,), (sv,)], ['col1']) \
.withColumn("as_array", vector_to_array(F.col('col1'))) \
.show(truncate=False)
prints
+-------------------+---------------+
|col1 |as_array |
+-------------------+---------------+
|[1.0,0.0,3.0] |[1.0, 0.0, 3.0]|
|(3,[0,2],[1.0,3.0])|[1.0, 0.0, 3.0]|
+-------------------+---------------+

How to do a regression starting from a list of list of elements

I am trying to run a regression in python. I have a list of lists (part of a bigger list) that looks something like this:
[[1307622004, 0.0, 339.093, 130.132],
[10562004, 0.0, 206.818, 62.111],
[127882004, 0.0, 994.624, 360.497],
[63702004, 0.0, 89.653, 19.103],
[655902004, 0.0, 199.613, 83.296],
[76482004, 0.0, 1891.0, 508.0],
[16332004, 0.0, 160.344, 25.446],
[294352004, 0.0, 67.115, 22.646],
[615922004, 0.0, 134.501, 41.01],
[1212572004, 0.0, 232.616, 5.086],
[658992004, 0.0, 189.155, 7.906],
[61962004, 0.0, 806.7, 164.1],
[121712004, 0.0, 1147.532, 271.014],
[1250142004, 0.0, 29.556, -5.721],
[148082004, 0.0, 22.05, -17.655]]
It looks like this because each line is a row from a CSV file I am importing the data from. From this point on, please look at the elements in the list as columns of variables to better understand what I am trying to a regression on. For example, the first 4 lists from the list would look something like that turned into columns (I do not need the variables turned into columns I've done it for illustration purposes):
1307622004 0.0 339.093 130.132
10562004 0.0 206.818 62.111
127882004 0.0 994.624 360.497
To continue my example, I want the first column to be my dependent variable and all the other columns to be independent variables.
I have tried using numpy to transform the list into an array and then apply sklearn regression. Below is a code snippet:
Important to note: The list_of_lists contains many elements similar to the list I have provided at the beginning of the question.
from sklearn import datasets ## imports datasets from scikit-learn
from sklearn.linear_model import LinearRegression
import numpy as np
for item in list_of_lists:
test_array = np.asarray(item)
# print(test_array)
X, Y = test_array[:, 0], test_array[:, 1]
mdl = LinearRegression().fit(X, Y)
scores = LinearRegression.score(X, Y)
print('--------------------------')
The problem is that I get the following output:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I am quite new to python and the usage of arrays, matrixes in python so I don't really understand what is happening.

I'm not really sure why you're iterating through your list of lists. Better is just to fit your regression to your array. Also, if you want the first column to be your response (dependent) variable, and all the rest to be predictor (independent) variables, you need to change the definitions of X and Y, because as you have it, you have the first column as a predictor and the second column as a response:
test_array = np.asarray(list_of_lists)
# Set independent variables to be all columns after first, dependent to first col
X, Y = test_array[:, 1:], test_array[:, 0]
# Create a regressor
reg = LinearRegression()
# Fit it to your data
mdl = reg.fit(X, Y)
# Exctract the R^2
scores = reg.score(X,Y)

Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]

This question already has an answer here:
Spark ML VectorAssembler returns strange output
(1 answer)
Closed 5 years ago.
My python version is 3.6.3 and spark version is 2.2.1. Here is my code:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession.builder.appName("Data Preprocessor") \
.config("spark.some.config.option", "1") \
.getOrCreate()
dataset = spark.createDataFrame([(0, 59.0, 0.0, Vectors.dense([2.0, 0.0,
0.0, 0.0, 0.0, 0.0, 0.0, 9.0, 9.0, 9.0]), 1.0)],
["id", "hour", "mobile", "userFeatures", "clicked"])
assembler = VectorAssembler(inputCols=["hour", "mobile", "userFeatures"],
outputCol="features")
output = assembler.transform(dataset)
output.select("features").show(truncate=False)
Instead of getting a single vector, I am getting following output:
(12,[0,2,9,10,11],[59.0,2.0,9.0,9.0,9.0])

The vector returned by vectorAssembler is in sparseVector form.
12 is the number of features. ([0,2,9,10,11]) are the indices of the non-zero values. [59.0,2.0,9.0,9.0,9.0] are the non-zero values.

Aligning two data sets in Python

I want to develop some python code to align datasets obtained by different instruments recording the same event.
As an example, say I have two sets of measurements:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({'TIME':[1.1, 2.4, 3.2, 4.1, 5.3],\
'VALUE':[10.3, 10.5, 11.0, 10.9, 10.7],\
'ERROR':[0.2, 0.1, 0.4, 0.3, 0.2]})
data2 = pd.DataFrame({'TIME':[0.9, 2.1, 2.9, 4.2],\
'VALUE':[18.4, 18.7, 18.9, 18.8],\
'ERROR':[0.3, 0.2, 0.5, 0.4]})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE, yerr=data2.ERROR, fmt='bo')
plt.show()
The result is plotted here:
What I would like to do now is to align the second dataset (data2) to the first one (data1). i.e. to get this:
The second dataset must be shifted to match the first one by subtracting a constant (to be determined) from all its values. All I know is that the datasets are correlated since the two instruments are measuring the same event but with different sampling rates.
At this stage I do not want to make any assumptions about what function best describes the data (fitting will be done after alignment).
I am cautious about using means to perform shifts since it may produce bad results, depending on how the data is sampled. I was considering taking each data2[TIME_i] and working out the shortest distance to data1[~TIME_i]. Then minimizing the sum of those. But I am not sure that would work well either.
Does anyone have any suggestions on a good method to use? I looked at mlpy but it seems to only work on 1D arrays.
Thanks.

You can substract the mean of the difference: data2.VALUE-(data2.VALUE - data1.VALUE).mean()
import pandas as pd
import matplotlib.pyplot as plt
# Define some data
data1 = pd.DataFrame({
'TIME': [1.1, 2.4, 3.2, 4.1, 5.3],
'VALUE': [10.3, 10.5, 11.0, 10.9, 10.7],
'ERROR': [0.2, 0.1, 0.4, 0.3, 0.2],
})
data2 = pd.DataFrame({
'TIME': [0.9, 2.1, 2.9, 4.2],
'VALUE': [18.4, 18.7, 18.9, 18.8],
'ERROR': [0.3, 0.2, 0.5, 0.4],
})
# Plot the data
plt.errorbar(data1.TIME, data1.VALUE, yerr=data1.ERROR, fmt='ro')
plt.errorbar(data2.TIME, data2.VALUE-(data2.VALUE - data1.VALUE).mean(),
yerr=data2.ERROR, fmt='bo')
plt.show()
Another possibility is to subtract the mean of each series

You can calculate the offset of the average and subtract that from every value. If you do this for every value they should align relatively well. This would assume both dataset look relatively similar, so it might not work the best.
Although this question is not Matlab related, you might still be interested in this:
Remove unknown DC Offset from a non-periodic discrete time signal

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark Py4j IllegalArgumentException with spark.createDataFrame and pyspark.ml.clustering - python

Related

Does this function computeSVD use MapReduce in Pyspark

Creating vectors.dense, and sparse.dense, are they identical?

How to do a regression starting from a list of list of elements

Vector assembler in Pyspark is creating tuple of multiple vectors instead of a single vector, how to solve the issue? [duplicate]

Aligning two data sets in Python

Categories

Resources