Does this function computeSVD use MapReduce in Pyspark - python

Does computeSVD() use map , reduce
since it is a predefined function?
i couldn't know the code of the function.
from pyspark.mllib.linalg import Vectors
from pyspark.mllib.linalg.distributed import RowMatrix
rows = sc.parallelize([
Vectors.sparse(5, {1: 1.0, 3: 7.0}),
Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
])
mat = RowMatrix(rows)
# Compute the top 5 singular values and corresponding singular vectors.
svd = mat.computeSVD(5, computeU=True) <------------- this function
U = svd.U # The U factor is a RowMatrix.
s = svd.s # The singular values are stored in a local dense vector.
V = svd.V # The V factor is a local dense matrix.

It does, from Spark documentation
This page documents sections of the MLlib guide for the RDD-based API (the spark.mllib package). Please see the MLlib Main Guide for the DataFrame-based API (the spark.ml package), which is now the primary API for MLlib.
If you want to look at code base, here it is https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala#L328

Related

Creating vectors.dense, and sparse.dense, are they identical?

I am trying to make more sense of these two types, so i am creating these 2 arrays to see if i am doing it right. What i am doing now is creating 2 identical arrays, my goal is:
dv = [1.0, 0.0, 3.0]
sv = [1.0, 0.0, 3.0]
So i wrote the syntax below,
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
spark = SparkSession.builder.getOrCreate()
dv = Vectors.dense(1.0, 0.0, 3.0)
sv = Vectors.sparse(3, [(0,2), (1.,3.)])
Therefore, my first question is, is my syntax correct for achieving my goal?
my second question is, when i print them,
print(dv)
print(sv)
they return:
[1.0,0.0,3.0]
(3,[0,1],[2.0,3.0])
so, how do i show the "real" array of sv? like not in this "Vectors.dense? form?
The creation of the sparse vector is slightly incorrect. From the docs: the second and third parameter should be
two sorted lists containing indices and values
This gives
sv = Vectors.sparse(3, [0,2], [1.,3])
To transform the vectors into an arrays the function vector_to_array can be used.
from pyspark.sql import functions as F
from pyspark.ml.functions import vector_to_array
spark.createDataFrame([(dv,), (sv,)], ['col1']) \
.withColumn("as_array", vector_to_array(F.col('col1'))) \
.show(truncate=False)
prints
+-------------------+---------------+
|col1 |as_array |
+-------------------+---------------+
|[1.0,0.0,3.0] |[1.0, 0.0, 3.0]|
|(3,[0,2],[1.0,3.0])|[1.0, 0.0, 3.0]|
+-------------------+---------------+

Pyspark Py4j IllegalArgumentException with spark.createDataFrame and pyspark.ml.clustering

Let me disclose the full background of my problem first, I'll have a simplified MWE that recreates the same issue at the bottom. Feel free to skip me rambling about my setup and go straight to the last section.
The Actors in my Original Problem:
A spark dataframe data read from Amazon S3, with a column scaled_features that ultimately is the result of a VectorAssembler operation followed by a MinMaxScaler.
A spark dataframe column pca_features that results from the above df column after a PCA like so:
mat = RowMatrix(data.select('scaled_features').rdd.map(list))
pc = mat.computePrincipalComponents(2)
projected = mat.multiply(pc).rows.map(lambda x: (x, )).toDF().withColumnRenamed('_1', 'pca_features')
Two instances of BisectingKMeans fitting to both instances of features in the abovementioned data frames like so:
kmeans_scaled = BisectingKMeans(featuresCol='scaled_features').setK(4).setSeed(1)
model1 = kmeans_scaled.fit(data)
kmeans_pca = BisectingKMeans(featuresCol='pca_features').setK(4).setSeed(1)
model2 = kmeans_pca.fit(projected)
The Issue:
While BisectingKMeans fits to scaled_features from my first df without issues, when attempting a fit to the projected features, it errors out with the following
Py4JJavaError: An error occurred while calling o1413.fit.
: java.lang.IllegalArgumentException: requirement failed: Column features must be of type equal to one of the following types:
[struct<type:tinyint,size:int,indices:array<int>,values:array<double>>, array<double>, array<float>]
but was actually of type struct<type:tinyint,size:int,indices:array<int>,values:array<double>>.
As you can see, Py4J complains that I'm passing data in a certain struct type that happens to be the first type specified in the list of allowed types.
Additional Debug Info:
My Spark is running version 2.4.0
Checking the dtypes yields: data.dtypes: [('scaled_features', 'vector')] and projected.dtypes: [('pca_features', 'vector')]. The Schema is the same for both dataframes as well, printing just one for reference:
root
|-- scaled_features: vector (nullable = true)
Recreating the error (MWE):
It turns out that this same error can be recreated by creating a simple data frame from some Vectors (the columns in my original dfs are of VectorType as well):
from pyspark.sql import Row
from pyspark.mllib.linalg import DenseVector
from pyspark.ml.clustering import BisectingKMeans
test_data = spark.createDataFrame([Row(test_features=DenseVector([43.0, 0.0, 200.0, 1.0, 1.0, 1.0, 0.0, 3.0])),
Row(test_features=DenseVector([44.0, 0.0, 250.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
Row(test_features=DenseVector([23.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 1.0])),
Row(test_features=DenseVector([25.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 2.0])),
Row(test_features=DenseVector([19.0, 0.0, 200.0, 1.0, 0.0, 1.0, 0.0, 1.0]))])
kmeans_test = BisectingKMeans(featuresCol='test_features').setK(4).setSeed(1)
model3 = kmeans_test.fit(test_data)
The last line results in the same error I'm facing in my original setup.
Can anyone explain this error and suggest a way to rectify it?
After a few more days of investigation, I was pointed to the (rather embarrassing) cause of the issue:
Pyspark has two machine learning libraries: pyspark.ml and pyspark.mllib and it turns out they don't go well together. Replacing from pyspark.mllib.linalg import DenseVector by from pyspark.ml.linalg import DenseVector resolves all the issues.

How to do a regression starting from a list of list of elements

I am trying to run a regression in python. I have a list of lists (part of a bigger list) that looks something like this:
[[1307622004, 0.0, 339.093, 130.132],
[10562004, 0.0, 206.818, 62.111],
[127882004, 0.0, 994.624, 360.497],
[63702004, 0.0, 89.653, 19.103],
[655902004, 0.0, 199.613, 83.296],
[76482004, 0.0, 1891.0, 508.0],
[16332004, 0.0, 160.344, 25.446],
[294352004, 0.0, 67.115, 22.646],
[615922004, 0.0, 134.501, 41.01],
[1212572004, 0.0, 232.616, 5.086],
[658992004, 0.0, 189.155, 7.906],
[61962004, 0.0, 806.7, 164.1],
[121712004, 0.0, 1147.532, 271.014],
[1250142004, 0.0, 29.556, -5.721],
[148082004, 0.0, 22.05, -17.655]]
It looks like this because each line is a row from a CSV file I am importing the data from. From this point on, please look at the elements in the list as columns of variables to better understand what I am trying to a regression on. For example, the first 4 lists from the list would look something like that turned into columns (I do not need the variables turned into columns I've done it for illustration purposes):
1307622004 0.0 339.093 130.132
10562004 0.0 206.818 62.111
127882004 0.0 994.624 360.497
To continue my example, I want the first column to be my dependent variable and all the other columns to be independent variables.
I have tried using numpy to transform the list into an array and then apply sklearn regression. Below is a code snippet:
Important to note: The list_of_lists contains many elements similar to the list I have provided at the beginning of the question.
from sklearn import datasets ## imports datasets from scikit-learn
from sklearn.linear_model import LinearRegression
import numpy as np
for item in list_of_lists:
test_array = np.asarray(item)
# print(test_array)
X, Y = test_array[:, 0], test_array[:, 1]
mdl = LinearRegression().fit(X, Y)
scores = LinearRegression.score(X, Y)
print('--------------------------')
The problem is that I get the following output:
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
I am quite new to python and the usage of arrays, matrixes in python so I don't really understand what is happening.
I'm not really sure why you're iterating through your list of lists. Better is just to fit your regression to your array. Also, if you want the first column to be your response (dependent) variable, and all the rest to be predictor (independent) variables, you need to change the definitions of X and Y, because as you have it, you have the first column as a predictor and the second column as a response:
test_array = np.asarray(list_of_lists)
# Set independent variables to be all columns after first, dependent to first col
X, Y = test_array[:, 1:], test_array[:, 0]
# Create a regressor
reg = LinearRegression()
# Fit it to your data
mdl = reg.fit(X, Y)
# Exctract the R^2
scores = reg.score(X,Y)

How to create Coulomb Matrix with Python?

I need some coulomb matrices of molecules for a machine learning task.
Coulomb Matrix? Here's a paper describing it
I found the python package molml which has a method for it. However i can't figure out how to use the api for a single molecule only. In all examples they provide the method is called with two molecules, why?
How the example provides the method:
H2 = (['H', 'H'],
[[0.0, 0.0, 0.0],
[1.0, 0.0, 0.0]])
HCN = (['H', 'C', 'N'],
[[-1.0, 0.0, 0.0],
[ 0.0, 0.0, 0.0],
[ 1.0, 0.0, 0.0]])
feat.transform([H2, HCN])
I need something like this:
atomnames = [list of atomsymbols]
atomcoords = [list of [x,y,z] for the atoms]
coulombMatrice = CoulombMatrix((atomnames,atomcoords)
I also found another lib (QML) wich promises the possibility to generate coulomb matrices, but, i'm not able to install it on windows because it depends on linux gcc-fortran compilers, i already installed cygwin and gcc-fortran for this purpose.
Thank you, guys
I've implemented my own solution for the problem. There's much room for improvements. E.g. randomly sorted coulomb matrix and bag of bonds are still not implemented.
import numpy as np
def get_coulombmatrix(molecule, largest_mol_size=None):
"""
This function generates a coulomb matrix for the given molecule
if largest_mol size is provided matrix will have dimension lm x lm.
Padding is provided for the bottom and right _|
"""
numberAtoms = len(molecule.atoms)
if largest_mol_size == None or largest_mol_size == 0: largest_mol_size = numberAtoms
cij = np.zeros((largest_mol_size, largest_mol_size))
xyzmatrix = [[atom.position.x, atom.position.y, atom.position.z] for atom in molecule.atoms]
chargearray = [atom.atomic_number for atom in molecule.atoms]
for i in range(numberAtoms):
for j in range(numberAtoms):
if i == j:
cij[i][j] = 0.5 * chargearray[i] ** 2.4 # Diagonal term described by Potential energy of isolated atom
else:
dist = np.linalg.norm(np.array(xyzmatrix[i]) - np.array(xyzmatrix[j]))
cij[i][j] = chargearray[i] * chargearray[j] / dist # Pair-wise repulsion
return cij

How does MATLAB's ode15 handle integration of an equation containing a matrix of values?

I am translating MATLAB code into Python, but before worrying about the translation I would like to understand how MATLAB and specifically its ODE15s solver are interpreting an equation.
I have a function script, which is called upon in the master script, and this function script contains the equation:
function testFun=testFunction(t,f,dmat,releasevec)
testFun=(dmat*f)+(releasevec.');
Within testFunction, t refers to time, f to the value I am solving for, dmat to the matrix of constants I am curious about, and releasevec to a vector of additional constants.
The ODE15s solver in the master script works its magic with the following lines:
for i=1:1461
[~,f]=ode15s(#(t, f) testFunction(t, f, ...
[dAremoval(i), dFWtoA(i), dSWtoA(i), dStoA(i), dFSedtoA(i), dSSedtoA(i); ...
dAtoFW(i), dFWremoval(i), dSWtoFW(i), dStoFW(i), dFSedtoFW(i), dSSedtoFW(i); ...
dAtoSW(i), dFWtoSW(i), dSWremoval(i), dStoSW(i), dFSedtoSW(i), dSSedtoSW(i); ...
dAtoS(i), dFWtoS(i), dSWtoS(i), dSremoval(i), dFSedtoS(i), dSSedtoS(i); ...
dAtoFSed(i), dFWtoFSed(i), dSWtoFSed(i), dStoFSed(i), dFSedremoval(i), dSSedtoFSed(i); ...
dAtoSSed(i), dFWtoSSed(i), dSWtoSSed(i), dStoSSed(i), dFSedtoSSed(i), dSSedremoval(i)], ...
[Arelease(i), FWrelease(i), SWrelease(i), Srelease(i), FSedrelease(i), SSedrelease(i)]), [i, i+1], fresults(:, i),options);
fresults(:, i + 1) = f(end, :).';
fresults is a table initially of zeros that houses the f results. The options call odeset to get 'nonnegative' values. The d values matrix above is a 6x6 matrix. I already have all of the d values and release value calculated. My question is: how is ode15s performing the integration with a 6x6 matrix given in the testfunction equation? I have tried to solve this by hand, but have not been successful. Any help would be much appreciated!!
#
def func(y, t, params):
f = 5.75e-16
f = y
dmat, rvec = params
derivs = [(dmat*f)+rvec]
return derivs
# Parameters
dmat = np.array([[-1964977.10876756, 58831.976165, 39221.31744333, 1866923.81515922, 0.0, 0.0],
[58831.976165, -1.89800738e+09, 0.0, 1234.12447489, 21088.06180415, 14058.70786944],
[39221.31744333, 0.84352331, -7.59182852e+09, 0.0, 0.0, 0.0],
[1866923.81515922, 0.0, 0.0, -9.30598884e+08, 0.0, 0.0],
[0.0, 21088.10183616, 0.0, 0.0, -1.15427010e+09, 0.0],
[0.0, 0.0, 14058.73455744, 0.0, 0.0, -5.98519566e+09]], np.float)
new_d = np.ndarray.flatten(dmat)
rvec = np.array([[0.0], [0.0], [0.0], [0.0], [0.0], [0.0]])
f = 5.75e-16
# Initial conditions for ODE
y0 = f
# Parameters for ODE
params = [dmat, rvec]
# Times
tStop = 2.0
tStart = 0.0
tStep = 1.0
t = np.arange(tStart, tStop, tStep)
# Call the ODE Solver
soln = odeint(func, y0, t, args=(params,))
#y = odeint(lambda y, t: func(y,t,params), y0, t)
It says here that ode15s uses backward difference formula for differentiation.
Your differential equation is (as far as I understand) f' = testFunc(t,f) and it has some vector matrix calculations inside the function.
Then you can replace the differentiation by a backward difference formula that is:
f_next = f_prev + h*testFunc(t,f_next);
where f_prev is the initial values of the vector. Here there is no important difference in calculations just because testFunc(t,f) function includes a 6x6 matrix. Each time it solves an inverse problem to find f_next by creating Jacobian matrices numerically.
However, trying to code algorithms as matlab does may be harder than we think since matlab has some (optimization related or not) special treatments to the problems. You should be careful on each value you get.
Essentially, you need to change very few things. Use numpy.ndarray for the vectors and matrices. The time-stepping can be done using scipy.integrate.ode. You will need to re-initialize the integrator for every change in the ODE function or supply matrix and parameter as additional function parameters via set_f_parameter.
Closer to the matlab interface but restricted to lsoda is scipy.integrate.odeint. However, since you used a solver for stiff problems, this might be exactly what you need.

Categories

Resources