How to replace infinity in PySpark DataFrame - python

It seems like there is no support for replacing infinity values. I tried the code below and it doesn't work. Or am I missing out something?
a=sqlContext.createDataFrame([(None, None), (1, np.inf), (None, 2)])
a.replace(np.inf, 10)
Or do I have to take the painful route: convert PySpark DataFrame into pandas DataFrame, replace infinity values, and convert it back to PySpark DataFrame

It seems like there is no support for replacing infinity values.
Actually it looks like a Py4J bug not an issue with replace itself. See Support nan/inf between Python and Java.
As a workaround, you can try either UDF (slow option):
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col, lit, udf, when
df = sc.parallelize([(None, None), (1.0, np.inf), (None, 2.0)]).toDF(["x", "y"])
replace_infs_udf = udf(
lambda x, v: float(v) if x and np.isinf(x) else x, DoubleType()
)
df.withColumn("x1", replace_infs_udf(col("y"), lit(-99.0))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+
or expression like this:
def replace_infs(c, v):
is_infinite = c.isin([
lit("+Infinity").cast("double"),
lit("-Infinity").cast("double")
])
return when(c.isNotNull() & is_infinite, v).otherwise(c)
df.withColumn("x1", replace_infs(col("y"), lit(-99))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+

Related

How to detect monotonic decrease in pyspark

I am working with a spark DataFrame where I would like to detect any value from a specific column where the value does not monotonically decrease. For those values, I would like to replace them with the previous value according to the ordering criteria.
Here is a conceptual example, if I have a column of value [65, 66, 62, 100, 40]. the value "100" is not following the monotonic decrease trend and therefore should be replaced by 62. So the resulting list will be [65, 66, 62, 62, 40].
Below is some code that I created to detect the value that must be replaced however I don't know how to replace the value by the previous and also how to ignore the initial null value from the lag.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as psf
from pyspark.sql.window import Window
sc = SparkContext(appName="sample-app")
sqlc = SQLContext(sc)
rdd = sc.parallelize([(1, 65), (2, 66), (3, 62), (4, 100), (5, 40)])
df = sqlc.createDataFrame(rdd, ["id", "value"])
window = Window.orderBy(df.id).rowsBetween(-1, -1)
sdf = df.withColumn(
"__monotonic_col",
(df.value <= psf.lag(df.value, 1).over(window)) & df.value.isNotNull(),
)
sdf.show()
This code produce the following output:
+---+-----+---------------+
| id|value|__monotonic_col|
+---+-----+---------------+
| 1| 65| null|
| 2| 66| false|
| 3| 62| true|
| 4| 100| false|
| 5| 40| true|
+---+-----+---------------+
Firstly, if my understanding is correct, shouldn't the 66 also be replaced (by 65) as it does not follow the decreasing trend?
If that is the correct interpretation, then the following should work (I have added an extra column to keep things tidy, but you could wrap everything into a single column creation statement):
from pyspark.sql import functions as F
sdf = sdf.withColumn(
"__monotonic_col_value",
F.when(
F.col("__monotonic_col") | F.col("__monotonic_col").isNull(), df.value)
.otherwise(
F.lag(df.value, 1).over(window)
),
)

How to run exponential weighted moving average in pyspark

I am trying to run exponential weighted moving average in PySpark using a Grouped Map Pandas UDF. It doesn't work though:
def ExpMA(myData):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql import SQLContext
df = myData
group_col = 'Name'
sort_col = 'Date'
schema = df.select(group_col, sort_col,'count').schema
print(schema)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
Model = pd.DataFrame(pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean()))
return Model
data = df.groupby('Name').apply(ema)
return data
I also tried running it without the Pandas udf, just writing the ewma equation in PySpark, but the problem there is that the ewma equation contains the lag of the current ewma.
First of all your Pandas code is incorrect. This just won't work, Spark or not
pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean())
Another problem is the output schema, which depending on your data, won't really accommodate the result:
If want to add ewm schema should be extended.
If you want to return only ewm then schema is to large.
If you want to just replace, it might not match the type.
Let's assume this is the first scenario (I allowed myself to rewrite your code a bit):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql.types import DoubleType, StructField
def exp_ma(df, group_col='Name', sort_col='Date'):
schema = (df.select(group_col, sort_col, 'count')
.schema.add(StructField('ewma', DoubleType())))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
pdf['ewm'] = pdf['count'].ewm(span=5, min_periods=1).mean()
return pdf
return df.groupby('Name').apply(ema)
df = spark.createDataFrame(
[("a", 1, 1), ("a", 2, 3), ("a", 3, 3), ("b", 1, 10), ("b", 8, 3), ("b", 9, 0)],
("name", "date", "count")
)
exp_ma(df).show()
# +----+----+-----+------------------+
# |Name|Date|count| ewma|
# +----+----+-----+------------------+
# | b| 1| 10| 10.0|
# | b| 8| 3| 5.800000000000001|
# | b| 9| 0|3.0526315789473686|
# | a| 1| 1| 1.0|
# | a| 2| 3| 2.2|
# | a| 3| 3| 2.578947368421052|
# +----+----+-----+------------------+
I don't use much Pandas so there might be more elegant way of doing this.

(py)Spark Parallelized Maximum Likelihood Calculation

I have two quick rookie questions on (py)Spark. I have a Dataframe as below, I want to calculate the likelihood of the 'reading' column using scipy's multivariate_normal.pdf()
rdd_dat = spark.sparkContext.parallelize([(0, .12, "a"),(1, .45, "b"),(2, 1.01, "c"),(3, 1.2, "a"),
(4, .76, "a"),(5, .81, "c"),(6, 1.5, "b")])
df = rdd_dat.toDF(["id", "reading", "category"])
df.show()
+---+-------+--------+
| id|reading|category|
+---+-------+--------+
| 0| 0.12| a|
| 1| 0.45| b|
| 2| 1.01| c|
| 3| 1.2| a|
| 4| 0.76| a|
| 5| 0.81| c|
| 6| 1.5| b|
+---+-------+--------+
This is my attempt using the UserDefinedFunction:
from scipy.stats import multivariate_normal
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
mle = UserDefinedFunction(multivariate_normal.pdf, DoubleType())
mean =1
cov=1
df_with_mle = df.withColumn("MLE", mle(df['reading']))
This runs without throwing an error, but when I want to look at the resulting df_with_mle, I get the error below:
df_with_mle.show()
An error occurred while calling o149.showString.
1) Any idea why I am getting this error?
2) If I wanted to specify the mean and cov, like: df.withColumn("MLE", mle(df['reading'], 1, 1)), how I can I do this?
The multivariate_normal.pdf() method from scipy is expecting to receive a series. A column from pandas dataframe is a series, but a column from a PySpark dataframe is a different kind of object (a pyspark.sql.column.Column), which Scipy doesn't know how to handle.
Also, and this won't keep your function call from running, your function definition ends without specifying the parameters - cov and mean aren't defined explicitly in the API unless they occur within the method call. Mean and Cov are just integer objects until you set them as parameters and override the defaults (mean=0, cov=1, from the scipy documentation:
multivariate_normal.pdf(x=df['reading'], mean=mean,cov=cov)

Spark withColumn() performing power functions

I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.
df = df.withColumn("col3", 100**(df("col1")))*df("col2")
However, this always results in:
TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'
I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.
If I perform
results = df.map(lambda x : 100**(df("col2"))*df("col2"))
this works, but I can't append to my original data frame.
Any thoughts?
This is my first time posting, so I apologize for any formatting problems.
Since Spark 1.4 you can usepow function as follows:
from pyspark.sql import Row
from pyspark.sql.functions import pow, col
row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()
df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()
## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## | 1| 2| 1.0|
## | 2| 3| 8.0|
## | 3| 3|27.0|
## +----+----+----+
If you use an older version a Python UDF should do the trick:
import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())
Just to complement the accepted answer: one can now do something very similar to what the OP tried to do, i.e., use the ** operator, or even Python's builtin pow:
from pyspark.sql import SparkSession
from pyspark.sql.functions import pow as pow_
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )], 'n: int')
df = df.withColumn('pyspark_pow', pow_(df['n'], df['n'])) \
.withColumn('python_pow', pow(df['n'], df['n'])) \
.withColumn('double_star_operator', df['n'] ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 1.0| 1.0| 1.0|
| 2| 4.0| 4.0| 4.0|
| 3| 27.0| 27.0| 27.0|
| 4| 256.0| 256.0| 256.0|
| 5| 3125.0| 3125.0| 3125.0|
| 6| 46656.0| 46656.0| 46656.0|
+---+-----------+----------+--------------------+
As one can see, both PySpark's and Python's pow return the same result, as well as the ** operator. It also works when one of the arguments is a scalar:
df = df.withColumn('pyspark_pow', pow_(2, df['n'])) \
.withColumn('python_pow', pow(2, df['n'])) \
.withColumn('double_star_operator', 2 ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 2.0| 2.0| 2.0|
| 2| 4.0| 4.0| 4.0|
| 3| 8.0| 8.0| 8.0|
| 4| 16.0| 16.0| 16.0|
| 5| 32.0| 32.0| 32.0|
| 6| 64.0| 64.0| 64.0|
+---+-----------+----------+--------------------+
I believe the reason Python's pow now work on PySpark columns, is the fact that pow is equivalent to the ** operator when used with only two arguments (see docs, here), and the ** operator uses the objects own implementation of the power operation, if it is defined for the object being operated on (see this SO response here).
Apparently, PySpark's Column has the proper definitions for __pow__ operator (see source code for Column).
I am not sure why the ** operator did not work originally, but I am assuming it is related to the fact that - at the time - Column was defined differently.
The stack used for testing was Python 3.8.5 and PySpark 3.1.1, but I have seen this behavior for PySpark >= 2.4 as well.

Replace empty strings with None/null values in DataFrame

I have a Spark 1.5.0 DataFrame with a mix of null and empty strings in the same column. I want to convert all empty strings in all columns to null (None, in Python). The DataFrame may have hundreds of columns, so I'm trying to avoid hard-coded manipulations of each column.
See my attempt below, which results in an error.
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
## Create a test DataFrame
testDF = sqlContext.createDataFrame([Row(col1='foo', col2=1), Row(col1='', col2=2), Row(col1=None, col2='')])
testDF.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## | | 2|
## |null|null|
## +----+----+
## Try to replace an empty string with None/null
testDF.replace('', None).show()
## ValueError: value should be a float, int, long, string, list, or tuple
## A string value of null (obviously) doesn't work...
testDF.replace('', 'null').na.drop(subset='col1').show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## +----+----+
It is as simple as this:
from pyspark.sql.functions import col, when
def blank_as_null(x):
return when(col(x) != "", col(x)).otherwise(None)
dfWithEmptyReplaced = testDF.withColumn("col1", blank_as_null("col1"))
dfWithEmptyReplaced.show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## |null| 2|
## |null|null|
## +----+----+
dfWithEmptyReplaced.na.drop().show()
## +----+----+
## |col1|col2|
## +----+----+
## | foo| 1|
## +----+----+
If you want to fill multiple columns you can for example reduce:
to_convert = set([...]) # Some set of columns
reduce(lambda df, x: df.withColumn(x, blank_as_null(x)), to_convert, testDF)
or use comprehension:
exprs = [
blank_as_null(x).alias(x) if x in to_convert else x for x in testDF.columns]
testDF.select(*exprs)
If you want to specifically operate on string fields please check the answer by robin-loxley.
UDFs are not terribly efficient. The correct way to do this using a built-in method is:
df = df.withColumn('myCol', when(col('myCol') == '', None).otherwise(col('myCol')))
Simply add on top of zero323's and soulmachine's answers. To convert for all StringType fields.
from pyspark.sql.types import StringType
string_fields = []
for i, f in enumerate(test_df.schema.fields):
if isinstance(f.dataType, StringType):
string_fields.append(f.name)
My solution is much better than all the solutions I'v seen so far, which can deal with as many fields as you want, see the little function as the following:
// Replace empty Strings with null values
private def setEmptyToNull(df: DataFrame): DataFrame = {
val exprs = df.schema.map { f =>
f.dataType match {
case StringType => when(length(col(f.name)) === 0, lit(null: String).cast(StringType)).otherwise(col(f.name)).as(f.name)
case _ => col(f.name)
}
}
df.select(exprs: _*)
}
You can easily rewrite the function above in Python.
I learned this trick from #liancheng
If you are using python u can check the following.
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| | |
| |name3|null|
+----+-----+----+
def convertToNull(dfa):
for i in dfa.columns:
dfa = dfa.withColumn(i , when(col(i) == '', None ).otherwise(col(i)))
return dfa
convertToNull(dfa).show()
+----+-----+----+
| id| name| age|
+----+-----+----+
|null|name1| 50|
| 2| null|null|
|null|name3|null|
+----+-----+----+
I would add a trim to #zero323's solution to deal with cases of multiple white spaces:
def blank_as_null(x):
return when(trim(col(x)) != "", col(x))
Thanks to #zero323 , #Tomerikoo and #Robin Loxley
Ready to use function:
def convert_blank_to_null(df, cols=None):
from pyspark.sql.functions import col, when, trim
from pyspark.sql.types import StringType
def blank_as_null(x):
return when(trim(col(x)) == "", None).otherwise(col(x))
# Don't know how to parallel
for f in (df.select(cols) if cols else df).schema.fields:
if isinstance(f.dataType, StringType):
df = df.withColumn(f.name, blank_as_null(f.name))
return df
This is a different version of soulmachine's solution, but I don't think you can translate this to Python as easily:
def emptyStringsToNone(df: DataFrame): DataFrame = {
df.schema.foldLeft(df)(
(current, field) =>
field.dataType match {
case DataTypes.StringType =>
current.withColumn(
field.name,
when(length(col(field.name)) === 0, lit(null: String)).otherwise(col(field.name))
)
case _ => current
}
)
}

Categories

Resources