(py)Spark Parallelized Maximum Likelihood Calculation

(py)Spark Parallelized Maximum Likelihood Calculation - python

I have two quick rookie questions on (py)Spark. I have a Dataframe as below, I want to calculate the likelihood of the 'reading' column using scipy's multivariate_normal.pdf()
rdd_dat = spark.sparkContext.parallelize([(0, .12, "a"),(1, .45, "b"),(2, 1.01, "c"),(3, 1.2, "a"),
(4, .76, "a"),(5, .81, "c"),(6, 1.5, "b")])
df = rdd_dat.toDF(["id", "reading", "category"])
df.show()
+---+-------+--------+
| id|reading|category|
+---+-------+--------+
| 0| 0.12| a|
| 1| 0.45| b|
| 2| 1.01| c|
| 3| 1.2| a|
| 4| 0.76| a|
| 5| 0.81| c|
| 6| 1.5| b|
+---+-------+--------+
This is my attempt using the UserDefinedFunction:
from scipy.stats import multivariate_normal
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
mle = UserDefinedFunction(multivariate_normal.pdf, DoubleType())
mean =1
cov=1
df_with_mle = df.withColumn("MLE", mle(df['reading']))
This runs without throwing an error, but when I want to look at the resulting df_with_mle, I get the error below:
df_with_mle.show()
An error occurred while calling o149.showString.
1) Any idea why I am getting this error?
2) If I wanted to specify the mean and cov, like: df.withColumn("MLE", mle(df['reading'], 1, 1)), how I can I do this?

The multivariate_normal.pdf() method from scipy is expecting to receive a series. A column from pandas dataframe is a series, but a column from a PySpark dataframe is a different kind of object (a pyspark.sql.column.Column), which Scipy doesn't know how to handle.
Also, and this won't keep your function call from running, your function definition ends without specifying the parameters - cov and mean aren't defined explicitly in the API unless they occur within the method call. Mean and Cov are just integer objects until you set them as parameters and override the defaults (mean=0, cov=1, from the scipy documentation:
multivariate_normal.pdf(x=df['reading'], mean=mean,cov=cov)

Related

How to extract all elements after last underscore in pyspark?

I have a pyspark dataframe with a column I am trying to extract information from. To give you an example, the column is a combination of 4 foreign keys which could look like this:
Ex 1: 12345-123-12345-4
Ex 2: 5678-4321-123-12
I am trying to extract the last piece of the string, in this case the 4 & 12. Any idea on how I can do this?
I've tried the following:
df.withColumn("result", sf.split(sf.col("column_to_split"), '\_')[1])\
.withColumn("result", sf.col("result").cast('integer'))
However, the result for double digit values is null, and it only returns an integer for single digits (0-9)
Thanks!

For spark2.4,You should use element_at -1 on your array after split
from pyspark.sql import functions as sf
df.withColumn("result", sf.element_at(sf.split("column_to_split","\-"),-1).cast("int")).show()
+-----------------+------+
| column_to_split|result|
+-----------------+------+
|12345-123-12345-4| 4|
| 5678-4321-123-12| 12|
+-----------------+------+

Mohammad's answer is very clean and a nice solution. However if you need a solution for Spark versions < 2.4, you can utilise the reverse string functionality and take the first element, reverse it back and turn into an Integer, f.e.:
import pandas as pd
import pyspark.sql.functions as f
import pyspark.sql.types as t
df = pd.DataFrame()
df['column_to_split'] = ["12345-123-12345-4", "5678-4321-123-12"]
df = spark.createDataFrame(df)
df.withColumn("result",
f.reverse(f.split(f.reverse("column_to_split"), "-")[0]). \
cast(t.IntegerType())).show(2, False)
+-----------------+------+
|column_to_split |result|
+-----------------+------+
|12345-123-12345-4|4 |
|5678-4321-123-12 |12 |
+-----------------+------+

This is how to get the last digits of the serial number above:
serial_no = '12345-123-12345-4'
last_digit = serial_no.split('-')[-1]
print(last_digit)
So in your case, try:
df.withColumn("result", int(sf.col("column_to_split").split('-')[-1]))
If it doesn't work, please share the result.

Adding up another ways:
You can use .regexp_extract() (or) .substring_index() function also:
Example:
df.show()
#+-----------------+
#| column_to_split|
#+-----------------+
#|12345-123-12345-4|
#| 5678-4321-123-12|
#+-----------------+
df.withColumn("result",regexp_extract(col("column_to_split"),"([^-]+$)",1).cast("int")).\
withColumn("result1",substring_index(col("column_to_split"),"-",-1).cast("int")).\
show()
#+-----------------+------+-------+
#| column_to_split|result|result1|
#+-----------------+------+-------+
#|12345-123-12345-4| 4| 4|
#| 5678-4321-123-12| 12| 12|
#+-----------------+------+-------+

PySpark 2.2.0 : 'numpy.ndarray' object has no attribute 'indices'

Task
I'm calculating the size on the indices within a __SparseVector__ using Python API for Spark (PySpark).
Script
def score_clustering(dataframe):
assembler = VectorAssembler(inputCols = dataframe.drop("documento").columns, outputCol = "variables")
data_transformed = assembler.transform(dataframe)
data_transformed_rdd = data_transformed.select("documento", "variables").orderBy(data_transformed.documento.asc()).rdd
count_variables = data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])
Issue
When I execute the action __.count()__ on the __count_variables__ dataframe an error shows up:
AttributeError: 'numpy.ndarray' object has no attribute 'indices'
The main part to consider is:
data_transformed_rdd.map(lambda row : [row[0], row[1].indices.size]).toDF(["id", "frequency"])
I believe this chunk has to do with the error, but I cannot understand why the exception is telling about __numpy.ndarray__ if I'm doing the calculations through mapping that __lambda expression__ whose taking as argument a __SparseVector__ (created with the __assembler__).
Any suggestions? Does anyone maybe know what I'm doing wrong?

There are two problems here. The first one is in indices.size call, indices and size are two different attributes of SparseVector class, size is the complete vector size and indices are the vector indices whose values are non-zero, but size is not a indices attribute. So, assuming that all your vectors are instances of SparseVector class:
from pyspark.ml.linalg import Vectors
df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
(1, Vectors.sparse(4, [], [])),
(3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
["documento", "variables"])
df.show()
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| (4,[],[])|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+
The solution is len function:
df = df.rdd.map(lambda x: (x[0], x[1], len(x[1].indices)))\
.toDF(["documento", "variables", "frecuencia"])
df.show()
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| (4,[],[])| 0|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+
And here comes the second problem: VectorAssembler does not always generate SparseVectors, depending on what is more efficient, SparseVector or DenseVectors can be generated (based on the number of zeros that your original vector has). For example, suppose the next data frame:
df = spark.createDataFrame([(0, Vectors.sparse(4, [0, 1], [11.0, 2.0])),
(1, Vectors.dense([1., 1., 1., 1.])),
(3, Vectors.sparse(4, [0,1,2], [2.0, 2.0, 2.0]))],
["documento", "variables"])
df.show()
+---------+--------------------+
|documento| variables|
+---------+--------------------+
| 0|(4,[0,1],[11.0,2.0])|
| 1| [1.0,1.0,1.0,1.0]|
| 3|(4,[0,1,2],[2.0,2...|
+---------+--------------------+
The document 1 is a DenseVector and the previos solution does not work because DenseVectors has not indices attribute, so you have to use a more general representation of vectors to work with a DataFrame which contains both sparse and dense vectors, for example numpy:
import numpy as np
df = df.rdd.map(lambda x: (x[0],
x[1],
np.nonzero(x[1])[0].size))\
.toDF(["documento", "variables", "frecuencia"])
df.show()
+---------+--------------------+----------+
|documento| variables|frecuencia|
+---------+--------------------+----------+
| 0|(4,[0,1],[11.0,2.0])| 2|
| 1| [1.0,1.0,1.0,1.0]| 4|
| 3|(4,[0,1,2],[2.0,2...| 3|
+---------+--------------------+----------+

How to run exponential weighted moving average in pyspark

I am trying to run exponential weighted moving average in PySpark using a Grouped Map Pandas UDF. It doesn't work though:
def ExpMA(myData):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql import SQLContext
df = myData
group_col = 'Name'
sort_col = 'Date'
schema = df.select(group_col, sort_col,'count').schema
print(schema)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
Model = pd.DataFrame(pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean()))
return Model
data = df.groupby('Name').apply(ema)
return data
I also tried running it without the Pandas udf, just writing the ewma equation in PySpark, but the problem there is that the ewma equation contains the lag of the current ewma.

First of all your Pandas code is incorrect. This just won't work, Spark or not
pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean())
Another problem is the output schema, which depending on your data, won't really accommodate the result:
If want to add ewm schema should be extended.
If you want to return only ewm then schema is to large.
If you want to just replace, it might not match the type.
Let's assume this is the first scenario (I allowed myself to rewrite your code a bit):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql.types import DoubleType, StructField
def exp_ma(df, group_col='Name', sort_col='Date'):
schema = (df.select(group_col, sort_col, 'count')
.schema.add(StructField('ewma', DoubleType())))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
pdf['ewm'] = pdf['count'].ewm(span=5, min_periods=1).mean()
return pdf
return df.groupby('Name').apply(ema)
df = spark.createDataFrame(
[("a", 1, 1), ("a", 2, 3), ("a", 3, 3), ("b", 1, 10), ("b", 8, 3), ("b", 9, 0)],
("name", "date", "count")
)
exp_ma(df).show()
# +----+----+-----+------------------+
# |Name|Date|count| ewma|
# +----+----+-----+------------------+
# | b| 1| 10| 10.0|
# | b| 8| 3| 5.800000000000001|
# | b| 9| 0|3.0526315789473686|
# | a| 1| 1| 1.0|
# | a| 2| 3| 2.2|
# | a| 3| 3| 2.578947368421052|
# +----+----+-----+------------------+
I don't use much Pandas so there might be more elegant way of doing this.

How to replace infinity in PySpark DataFrame

It seems like there is no support for replacing infinity values. I tried the code below and it doesn't work. Or am I missing out something?
a=sqlContext.createDataFrame([(None, None), (1, np.inf), (None, 2)])
a.replace(np.inf, 10)
Or do I have to take the painful route: convert PySpark DataFrame into pandas DataFrame, replace infinity values, and convert it back to PySpark DataFrame

It seems like there is no support for replacing infinity values.
Actually it looks like a Py4J bug not an issue with replace itself. See Support nan/inf between Python and Java.
As a workaround, you can try either UDF (slow option):
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col, lit, udf, when
df = sc.parallelize([(None, None), (1.0, np.inf), (None, 2.0)]).toDF(["x", "y"])
replace_infs_udf = udf(
lambda x, v: float(v) if x and np.isinf(x) else x, DoubleType()
)
df.withColumn("x1", replace_infs_udf(col("y"), lit(-99.0))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+
or expression like this:
def replace_infs(c, v):
is_infinite = c.isin([
lit("+Infinity").cast("double"),
lit("-Infinity").cast("double")
])
return when(c.isNotNull() & is_infinite, v).otherwise(c)
df.withColumn("x1", replace_infs(col("y"), lit(-99))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+

Spark withColumn() performing power functions

I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.
df = df.withColumn("col3", 100**(df("col1")))*df("col2")
However, this always results in:
TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'
I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.
If I perform
results = df.map(lambda x : 100**(df("col2"))*df("col2"))
this works, but I can't append to my original data frame.
Any thoughts?
This is my first time posting, so I apologize for any formatting problems.

Since Spark 1.4 you can usepow function as follows:
from pyspark.sql import Row
from pyspark.sql.functions import pow, col
row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()
df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()
## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## | 1| 2| 1.0|
## | 2| 3| 8.0|
## | 3| 3|27.0|
## +----+----+----+
If you use an older version a Python UDF should do the trick:
import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())

Just to complement the accepted answer: one can now do something very similar to what the OP tried to do, i.e., use the ** operator, or even Python's builtin pow:
from pyspark.sql import SparkSession
from pyspark.sql.functions import pow as pow_
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )], 'n: int')
df = df.withColumn('pyspark_pow', pow_(df['n'], df['n'])) \
.withColumn('python_pow', pow(df['n'], df['n'])) \
.withColumn('double_star_operator', df['n'] ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 1.0| 1.0| 1.0|
| 2| 4.0| 4.0| 4.0|
| 3| 27.0| 27.0| 27.0|
| 4| 256.0| 256.0| 256.0|
| 5| 3125.0| 3125.0| 3125.0|
| 6| 46656.0| 46656.0| 46656.0|
+---+-----------+----------+--------------------+
As one can see, both PySpark's and Python's pow return the same result, as well as the ** operator. It also works when one of the arguments is a scalar:
df = df.withColumn('pyspark_pow', pow_(2, df['n'])) \
.withColumn('python_pow', pow(2, df['n'])) \
.withColumn('double_star_operator', 2 ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 2.0| 2.0| 2.0|
| 2| 4.0| 4.0| 4.0|
| 3| 8.0| 8.0| 8.0|
| 4| 16.0| 16.0| 16.0|
| 5| 32.0| 32.0| 32.0|
| 6| 64.0| 64.0| 64.0|
+---+-----------+----------+--------------------+
I believe the reason Python's pow now work on PySpark columns, is the fact that pow is equivalent to the ** operator when used with only two arguments (see docs, here), and the ** operator uses the objects own implementation of the power operation, if it is defined for the object being operated on (see this SO response here).
Apparently, PySpark's Column has the proper definitions for __pow__ operator (see source code for Column).
I am not sure why the ** operator did not work originally, but I am assuming it is related to the fact that - at the time - Column was defined differently.
The stack used for testing was Python 3.8.5 and PySpark 3.1.1, but I have seen this behavior for PySpark >= 2.4 as well.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

(py)Spark Parallelized Maximum Likelihood Calculation - python

Related

How to extract all elements after last underscore in pyspark?

PySpark 2.2.0 : 'numpy.ndarray' object has no attribute 'indices'

How to run exponential weighted moving average in pyspark

How to replace infinity in PySpark DataFrame

Spark withColumn() performing power functions

Categories

Resources