I am working with a spark DataFrame where I would like to detect any value from a specific column where the value does not monotonically decrease. For those values, I would like to replace them with the previous value according to the ordering criteria.
Here is a conceptual example, if I have a column of value [65, 66, 62, 100, 40]. the value "100" is not following the monotonic decrease trend and therefore should be replaced by 62. So the resulting list will be [65, 66, 62, 62, 40].
Below is some code that I created to detect the value that must be replaced however I don't know how to replace the value by the previous and also how to ignore the initial null value from the lag.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as psf
from pyspark.sql.window import Window
sc = SparkContext(appName="sample-app")
sqlc = SQLContext(sc)
rdd = sc.parallelize([(1, 65), (2, 66), (3, 62), (4, 100), (5, 40)])
df = sqlc.createDataFrame(rdd, ["id", "value"])
window = Window.orderBy(df.id).rowsBetween(-1, -1)
sdf = df.withColumn(
"__monotonic_col",
(df.value <= psf.lag(df.value, 1).over(window)) & df.value.isNotNull(),
)
sdf.show()
This code produce the following output:
+---+-----+---------------+
| id|value|__monotonic_col|
+---+-----+---------------+
| 1| 65| null|
| 2| 66| false|
| 3| 62| true|
| 4| 100| false|
| 5| 40| true|
+---+-----+---------------+
Firstly, if my understanding is correct, shouldn't the 66 also be replaced (by 65) as it does not follow the decreasing trend?
If that is the correct interpretation, then the following should work (I have added an extra column to keep things tidy, but you could wrap everything into a single column creation statement):
from pyspark.sql import functions as F
sdf = sdf.withColumn(
"__monotonic_col_value",
F.when(
F.col("__monotonic_col") | F.col("__monotonic_col").isNull(), df.value)
.otherwise(
F.lag(df.value, 1).over(window)
),
)
Related
I am trying to run exponential weighted moving average in PySpark using a Grouped Map Pandas UDF. It doesn't work though:
def ExpMA(myData):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql import SQLContext
df = myData
group_col = 'Name'
sort_col = 'Date'
schema = df.select(group_col, sort_col,'count').schema
print(schema)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
Model = pd.DataFrame(pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean()))
return Model
data = df.groupby('Name').apply(ema)
return data
I also tried running it without the Pandas udf, just writing the ewma equation in PySpark, but the problem there is that the ewma equation contains the lag of the current ewma.
First of all your Pandas code is incorrect. This just won't work, Spark or not
pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean())
Another problem is the output schema, which depending on your data, won't really accommodate the result:
If want to add ewm schema should be extended.
If you want to return only ewm then schema is to large.
If you want to just replace, it might not match the type.
Let's assume this is the first scenario (I allowed myself to rewrite your code a bit):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql.types import DoubleType, StructField
def exp_ma(df, group_col='Name', sort_col='Date'):
schema = (df.select(group_col, sort_col, 'count')
.schema.add(StructField('ewma', DoubleType())))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
pdf['ewm'] = pdf['count'].ewm(span=5, min_periods=1).mean()
return pdf
return df.groupby('Name').apply(ema)
df = spark.createDataFrame(
[("a", 1, 1), ("a", 2, 3), ("a", 3, 3), ("b", 1, 10), ("b", 8, 3), ("b", 9, 0)],
("name", "date", "count")
)
exp_ma(df).show()
# +----+----+-----+------------------+
# |Name|Date|count| ewma|
# +----+----+-----+------------------+
# | b| 1| 10| 10.0|
# | b| 8| 3| 5.800000000000001|
# | b| 9| 0|3.0526315789473686|
# | a| 1| 1| 1.0|
# | a| 2| 3| 2.2|
# | a| 3| 3| 2.578947368421052|
# +----+----+-----+------------------+
I don't use much Pandas so there might be more elegant way of doing this.
Given a pyspark dataframe df with columns 'ProductId', 'Date' and 'Price', how safe is to sort by 'Date' and assume that func.first('Price') will always retrieve the Price corresponding to the minimum date?
I mean: will
df.orderBy('ProductId', 'Date').groupBy('ProductId').agg(func.first('Price'))
return for each product the first price paid in time without messing with the orderBy while grouping?
I am not sure if the order is guaranteed to be maintained for the groupBy(). However, here is alternative way to do what you want that will work.
Use pyspark.sql.Window to partition and order the DataFrame as desired. Then use pyspark.sql.DataFrame.distinct() to drop the duplicate entries.
For example:
Create Dummy Data
data = [
(123, '2017-07-01', 50),
(123, '2017-01-01', 100),
(345, '2018-01-01', 20),
(123, '2017-03-01', 25),
(345, '2018-02-01', 33)
]
df = sqlCtx.createDataFrame(data, ['ProductId', 'Date', 'Price'])
df.show()
#+---------+----------+-----+
#|ProductId| Date|Price|
#+---------+----------+-----+
#| 123|2017-07-01| 50|
#| 123|2017-01-01| 100|
#| 345|2018-01-01| 20|
#| 123|2017-03-01| 25|
#| 345|2018-02-01| 33|
#+---------+----------+-----+
Use Window
Use Window.partitionBy('ProductId').orderBy('Date'):
import pyspark.sql.functions as f
from pyspark.sql import Window
df.select(
'ProductId',
f.first('Price').over(Window.partitionBy('ProductId').orderBy('Date')).alias('Price')
).distinct().show()
#+---------+-----+
#|ProductId|Price|
#+---------+-----+
#| 123| 100|
#| 345| 20|
#+---------+-----+
Edit
I found this scala post in which the accepted answer says that the order is preserved, though there is a discussion in the comments that contradicts that.
I know we can use Window function in pyspark to calculate cumulative sum. But Window is only supported in HiveContext and not in SQLContext. I need to use SQLContext as HiveContext cannot be run in multi processes.
Is there any efficient way to calculate cumulative sum using SQLContext? A simple way is to load the data into the driver's memory and use numpy.cumsum, but the con is the data need to be able to fit into the memory
Not sure if this is what you are looking for but here are two examples how to use sqlContext to calculate the cumulative sum:
First when you want to partition it by some categories:
from pyspark.sql.types import StructType, StringType, LongType
from pyspark.sql import SQLContext
rdd = sc.parallelize([
("Tablet", 6500),
("Tablet", 5500),
("Cell Phone", 6000),
("Cell Phone", 6500),
("Cell Phone", 5500)
])
schema = StructType([
StructField("category", StringType(), False),
StructField("revenue", LongType(), False)
])
df = sqlContext.createDataFrame(rdd, schema)
df.registerTempTable("test_table")
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (PARTITION BY category ORDER BY revenue) as cumsum
FROM
test_table
""")
Output:
[Row(category='Tablet', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=6500, cumsum=12000),
Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Cell Phone', revenue=6000, cumsum=11500),
Row(category='Cell Phone', revenue=6500, cumsum=18000)]
Second when you only want to take the cumsum of one variable. Change df2 to this:
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (ORDER BY revenue, category) as cumsum
FROM
test_table
""")
Output:
[Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=5500, cumsum=11000),
Row(category='Cell Phone', revenue=6000, cumsum=17000),
Row(category='Cell Phone', revenue=6500, cumsum=23500),
Row(category='Tablet', revenue=6500, cumsum=30000)]
Hope this helps. Using np.cumsum is not very efficient after collecting the data especially if the dataset is large. Another way you could explore is to use simple RDD transformations like groupByKey() and then use map to calculate the cumulative sum of each group by some key and then reduce it at the end.
Here is a simple example:
import pyspark
from pyspark.sql import window
import pyspark.sql.functions as sf
sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)
data = sqlcontext.createDataFrame([("Bob", "M", "Boston", 1, 20),
("Cam", "F", "Cambridge", 1, 25),
("Lin", "F", "Cambridge", 1, 25),
("Cat", "M", "Boston", 1, 20),
("Sara", "F", "Cambridge", 1, 15),
("Jeff", "M", "Cambridge", 1, 25),
("Bean", "M", "Cambridge", 1, 26),
("Dave", "M", "Cambridge", 1, 21),],
["name", 'gender', "city", 'donation', "age"])
data.show()
gives output
+----+------+---------+--------+---+
|name|gender| city|donation|age|
+----+------+---------+--------+---+
| Bob| M| Boston| 1| 20|
| Cam| F|Cambridge| 1| 25|
| Lin| F|Cambridge| 1| 25|
| Cat| M| Boston| 1| 20|
|Sara| F|Cambridge| 1| 15|
|Jeff| M|Cambridge| 1| 25|
|Bean| M|Cambridge| 1| 26|
|Dave| M|Cambridge| 1| 21|
+----+------+---------+--------+---+
Define a window
win_spec = (window.Window
.partitionBy(['gender', 'city'])
.rowsBetween(window.Window.unboundedPreceding, 0))
# window.Window.unboundedPreceding -- first row of the group
# .rowsBetween(..., 0) -- 0 refers to current row, if instead -2 specified then upto 2 rows before current row
Now, here is a trap:
temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
with error :
TypeErrorTraceback (most recent call last)
<ipython-input-9-b467d24b05cd> in <module>()
----> 1 temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
/Users/mupadhye/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.pyc in __iter__(self)
238
239 def __iter__(self):
--> 240 raise TypeError("Column is not iterable")
241
242 # string methods
TypeError: Column is not iterable
This is due to using python's sum function instead of pyspark's. The way to fix this is using sum function from pyspark.sql.functions.sum:
temp = data.withColumn('AgeSum',sf.sum(data.donation).over(win_spec))
temp.show()
will give:
+----+------+---------+--------+---+--------------+
|name|gender| city|donation|age|CumSumDonation|
+----+------+---------+--------+---+--------------+
|Sara| F|Cambridge| 1| 15| 1|
| Cam| F|Cambridge| 1| 25| 2|
| Lin| F|Cambridge| 1| 25| 3|
| Bob| M| Boston| 1| 20| 1|
| Cat| M| Boston| 1| 20| 2|
|Dave| M|Cambridge| 1| 21| 1|
|Jeff| M|Cambridge| 1| 25| 2|
|Bean| M|Cambridge| 1| 26| 3|
+----+------+---------+--------+---+--------------+
After landing on this thread trying to solve a similar problem, I've solved my issue using this code. Not sure if I'm missing part of the OP, but this is a way to sum a SQLContext column:
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
sc = SparkContext()
sc.setLogLevel("ERROR")
conf = SparkConf()
conf.setAppName('Sum SQLContext Column')
conf.set("spark.executor.memory", "2g")
sqlContext = SQLContext(sc)
def sum_column(table, column):
sc_table = sqlContext.table(table)
return sc_table.agg({column: "sum"})
sum_column("db.tablename", "column").show()
It is not true that windows function works only with HiveContext. You can use them even with sqlContext:
from pyspark.sql.window import *
myPartition=Window.partitionBy(['col1','col2','col3'])
temp= temp.withColumn("#dummy",sum(temp.col4).over(myPartition))
It seems like there is no support for replacing infinity values. I tried the code below and it doesn't work. Or am I missing out something?
a=sqlContext.createDataFrame([(None, None), (1, np.inf), (None, 2)])
a.replace(np.inf, 10)
Or do I have to take the painful route: convert PySpark DataFrame into pandas DataFrame, replace infinity values, and convert it back to PySpark DataFrame
It seems like there is no support for replacing infinity values.
Actually it looks like a Py4J bug not an issue with replace itself. See Support nan/inf between Python and Java.
As a workaround, you can try either UDF (slow option):
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col, lit, udf, when
df = sc.parallelize([(None, None), (1.0, np.inf), (None, 2.0)]).toDF(["x", "y"])
replace_infs_udf = udf(
lambda x, v: float(v) if x and np.isinf(x) else x, DoubleType()
)
df.withColumn("x1", replace_infs_udf(col("y"), lit(-99.0))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+
or expression like this:
def replace_infs(c, v):
is_infinite = c.isin([
lit("+Infinity").cast("double"),
lit("-Infinity").cast("double")
])
return when(c.isNotNull() & is_infinite, v).otherwise(c)
df.withColumn("x1", replace_infs(col("y"), lit(-99))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+
I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.
df = df.withColumn("col3", 100**(df("col1")))*df("col2")
However, this always results in:
TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'
I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.
If I perform
results = df.map(lambda x : 100**(df("col2"))*df("col2"))
this works, but I can't append to my original data frame.
Any thoughts?
This is my first time posting, so I apologize for any formatting problems.
Since Spark 1.4 you can usepow function as follows:
from pyspark.sql import Row
from pyspark.sql.functions import pow, col
row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()
df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()
## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## | 1| 2| 1.0|
## | 2| 3| 8.0|
## | 3| 3|27.0|
## +----+----+----+
If you use an older version a Python UDF should do the trick:
import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())
Just to complement the accepted answer: one can now do something very similar to what the OP tried to do, i.e., use the ** operator, or even Python's builtin pow:
from pyspark.sql import SparkSession
from pyspark.sql.functions import pow as pow_
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )], 'n: int')
df = df.withColumn('pyspark_pow', pow_(df['n'], df['n'])) \
.withColumn('python_pow', pow(df['n'], df['n'])) \
.withColumn('double_star_operator', df['n'] ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 1.0| 1.0| 1.0|
| 2| 4.0| 4.0| 4.0|
| 3| 27.0| 27.0| 27.0|
| 4| 256.0| 256.0| 256.0|
| 5| 3125.0| 3125.0| 3125.0|
| 6| 46656.0| 46656.0| 46656.0|
+---+-----------+----------+--------------------+
As one can see, both PySpark's and Python's pow return the same result, as well as the ** operator. It also works when one of the arguments is a scalar:
df = df.withColumn('pyspark_pow', pow_(2, df['n'])) \
.withColumn('python_pow', pow(2, df['n'])) \
.withColumn('double_star_operator', 2 ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 2.0| 2.0| 2.0|
| 2| 4.0| 4.0| 4.0|
| 3| 8.0| 8.0| 8.0|
| 4| 16.0| 16.0| 16.0|
| 5| 32.0| 32.0| 32.0|
| 6| 64.0| 64.0| 64.0|
+---+-----------+----------+--------------------+
I believe the reason Python's pow now work on PySpark columns, is the fact that pow is equivalent to the ** operator when used with only two arguments (see docs, here), and the ** operator uses the objects own implementation of the power operation, if it is defined for the object being operated on (see this SO response here).
Apparently, PySpark's Column has the proper definitions for __pow__ operator (see source code for Column).
I am not sure why the ** operator did not work originally, but I am assuming it is related to the fact that - at the time - Column was defined differently.
The stack used for testing was Python 3.8.5 and PySpark 3.1.1, but I have seen this behavior for PySpark >= 2.4 as well.