I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.
df = df.withColumn("col3", 100**(df("col1")))*df("col2")
However, this always results in:
TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'
I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.
If I perform
results = df.map(lambda x : 100**(df("col2"))*df("col2"))
this works, but I can't append to my original data frame.
Any thoughts?
This is my first time posting, so I apologize for any formatting problems.
Since Spark 1.4 you can usepow function as follows:
from pyspark.sql import Row
from pyspark.sql.functions import pow, col
row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()
df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()
## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## | 1| 2| 1.0|
## | 2| 3| 8.0|
## | 3| 3|27.0|
## +----+----+----+
If you use an older version a Python UDF should do the trick:
import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())
Just to complement the accepted answer: one can now do something very similar to what the OP tried to do, i.e., use the ** operator, or even Python's builtin pow:
from pyspark.sql import SparkSession
from pyspark.sql.functions import pow as pow_
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )], 'n: int')
df = df.withColumn('pyspark_pow', pow_(df['n'], df['n'])) \
.withColumn('python_pow', pow(df['n'], df['n'])) \
.withColumn('double_star_operator', df['n'] ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 1.0| 1.0| 1.0|
| 2| 4.0| 4.0| 4.0|
| 3| 27.0| 27.0| 27.0|
| 4| 256.0| 256.0| 256.0|
| 5| 3125.0| 3125.0| 3125.0|
| 6| 46656.0| 46656.0| 46656.0|
+---+-----------+----------+--------------------+
As one can see, both PySpark's and Python's pow return the same result, as well as the ** operator. It also works when one of the arguments is a scalar:
df = df.withColumn('pyspark_pow', pow_(2, df['n'])) \
.withColumn('python_pow', pow(2, df['n'])) \
.withColumn('double_star_operator', 2 ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 2.0| 2.0| 2.0|
| 2| 4.0| 4.0| 4.0|
| 3| 8.0| 8.0| 8.0|
| 4| 16.0| 16.0| 16.0|
| 5| 32.0| 32.0| 32.0|
| 6| 64.0| 64.0| 64.0|
+---+-----------+----------+--------------------+
I believe the reason Python's pow now work on PySpark columns, is the fact that pow is equivalent to the ** operator when used with only two arguments (see docs, here), and the ** operator uses the objects own implementation of the power operation, if it is defined for the object being operated on (see this SO response here).
Apparently, PySpark's Column has the proper definitions for __pow__ operator (see source code for Column).
I am not sure why the ** operator did not work originally, but I am assuming it is related to the fact that - at the time - Column was defined differently.
The stack used for testing was Python 3.8.5 and PySpark 3.1.1, but I have seen this behavior for PySpark >= 2.4 as well.
Related
df = spark.createDataFrame(
[
(1, "AxtTR"), # create your data here, be consistent in the types.
(2, "HdyOP"),
(3, "EqoPIC"),
(4, "OkTEic"),
], ["id", "label"] )# add your column names here]
df.show()
Below code is in python , where i use apply function and tried extracting first 2 letters of every row. i want to replicate the same code in pyspark. where a function is used to apply on every single row and get the output.
def get_string(lst):
lst = str(lst)
lst = lst.lower
lst= lst[0:2]
return(lst)
df['firt_2letter'] = df['label'].apply(get_string)
The yellow marked as shown in below image is the expected output.
You can use the relevant Spark SQL functions:
import pyspark.sql.functions as F
df2 = df.withColumn('first_2letter', F.lower('label')[0:2])
df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
| 1| AxtTR| ax|
| 2| HdyOP| hd|
| 3|EqoPIC| eq|
| 4|OkTEic| ok|
+---+------+-------------+
If you want to use user-defined functions, you can define them as:
def get_string(lst):
lst = str(lst)
lst = lst.lower()
lst = lst[0:2]
return lst
import pyspark.sql.functions as F
df2 = df.withColumn('first_2letter', F.udf(get_string)('label'))
df2.show()
+---+------+-------------+
| id| label|first_2letter|
+---+------+-------------+
| 1| AxtTR| ax|
| 2| HdyOP| hd|
| 3|EqoPIC| eq|
| 4|OkTEic| ok|
+---+------+-------------+
I am trying to run exponential weighted moving average in PySpark using a Grouped Map Pandas UDF. It doesn't work though:
def ExpMA(myData):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql import SQLContext
df = myData
group_col = 'Name'
sort_col = 'Date'
schema = df.select(group_col, sort_col,'count').schema
print(schema)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
Model = pd.DataFrame(pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean()))
return Model
data = df.groupby('Name').apply(ema)
return data
I also tried running it without the Pandas udf, just writing the ewma equation in PySpark, but the problem there is that the ewma equation contains the lag of the current ewma.
First of all your Pandas code is incorrect. This just won't work, Spark or not
pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean())
Another problem is the output schema, which depending on your data, won't really accommodate the result:
If want to add ewm schema should be extended.
If you want to return only ewm then schema is to large.
If you want to just replace, it might not match the type.
Let's assume this is the first scenario (I allowed myself to rewrite your code a bit):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql.types import DoubleType, StructField
def exp_ma(df, group_col='Name', sort_col='Date'):
schema = (df.select(group_col, sort_col, 'count')
.schema.add(StructField('ewma', DoubleType())))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
pdf['ewm'] = pdf['count'].ewm(span=5, min_periods=1).mean()
return pdf
return df.groupby('Name').apply(ema)
df = spark.createDataFrame(
[("a", 1, 1), ("a", 2, 3), ("a", 3, 3), ("b", 1, 10), ("b", 8, 3), ("b", 9, 0)],
("name", "date", "count")
)
exp_ma(df).show()
# +----+----+-----+------------------+
# |Name|Date|count| ewma|
# +----+----+-----+------------------+
# | b| 1| 10| 10.0|
# | b| 8| 3| 5.800000000000001|
# | b| 9| 0|3.0526315789473686|
# | a| 1| 1| 1.0|
# | a| 2| 3| 2.2|
# | a| 3| 3| 2.578947368421052|
# +----+----+-----+------------------+
I don't use much Pandas so there might be more elegant way of doing this.
I have two quick rookie questions on (py)Spark. I have a Dataframe as below, I want to calculate the likelihood of the 'reading' column using scipy's multivariate_normal.pdf()
rdd_dat = spark.sparkContext.parallelize([(0, .12, "a"),(1, .45, "b"),(2, 1.01, "c"),(3, 1.2, "a"),
(4, .76, "a"),(5, .81, "c"),(6, 1.5, "b")])
df = rdd_dat.toDF(["id", "reading", "category"])
df.show()
+---+-------+--------+
| id|reading|category|
+---+-------+--------+
| 0| 0.12| a|
| 1| 0.45| b|
| 2| 1.01| c|
| 3| 1.2| a|
| 4| 0.76| a|
| 5| 0.81| c|
| 6| 1.5| b|
+---+-------+--------+
This is my attempt using the UserDefinedFunction:
from scipy.stats import multivariate_normal
from pyspark.sql.functions import UserDefinedFunction
from pyspark.sql.types import DoubleType
mle = UserDefinedFunction(multivariate_normal.pdf, DoubleType())
mean =1
cov=1
df_with_mle = df.withColumn("MLE", mle(df['reading']))
This runs without throwing an error, but when I want to look at the resulting df_with_mle, I get the error below:
df_with_mle.show()
An error occurred while calling o149.showString.
1) Any idea why I am getting this error?
2) If I wanted to specify the mean and cov, like: df.withColumn("MLE", mle(df['reading'], 1, 1)), how I can I do this?
The multivariate_normal.pdf() method from scipy is expecting to receive a series. A column from pandas dataframe is a series, but a column from a PySpark dataframe is a different kind of object (a pyspark.sql.column.Column), which Scipy doesn't know how to handle.
Also, and this won't keep your function call from running, your function definition ends without specifying the parameters - cov and mean aren't defined explicitly in the API unless they occur within the method call. Mean and Cov are just integer objects until you set them as parameters and override the defaults (mean=0, cov=1, from the scipy documentation:
multivariate_normal.pdf(x=df['reading'], mean=mean,cov=cov)
I know we can use Window function in pyspark to calculate cumulative sum. But Window is only supported in HiveContext and not in SQLContext. I need to use SQLContext as HiveContext cannot be run in multi processes.
Is there any efficient way to calculate cumulative sum using SQLContext? A simple way is to load the data into the driver's memory and use numpy.cumsum, but the con is the data need to be able to fit into the memory
Not sure if this is what you are looking for but here are two examples how to use sqlContext to calculate the cumulative sum:
First when you want to partition it by some categories:
from pyspark.sql.types import StructType, StringType, LongType
from pyspark.sql import SQLContext
rdd = sc.parallelize([
("Tablet", 6500),
("Tablet", 5500),
("Cell Phone", 6000),
("Cell Phone", 6500),
("Cell Phone", 5500)
])
schema = StructType([
StructField("category", StringType(), False),
StructField("revenue", LongType(), False)
])
df = sqlContext.createDataFrame(rdd, schema)
df.registerTempTable("test_table")
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (PARTITION BY category ORDER BY revenue) as cumsum
FROM
test_table
""")
Output:
[Row(category='Tablet', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=6500, cumsum=12000),
Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Cell Phone', revenue=6000, cumsum=11500),
Row(category='Cell Phone', revenue=6500, cumsum=18000)]
Second when you only want to take the cumsum of one variable. Change df2 to this:
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (ORDER BY revenue, category) as cumsum
FROM
test_table
""")
Output:
[Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=5500, cumsum=11000),
Row(category='Cell Phone', revenue=6000, cumsum=17000),
Row(category='Cell Phone', revenue=6500, cumsum=23500),
Row(category='Tablet', revenue=6500, cumsum=30000)]
Hope this helps. Using np.cumsum is not very efficient after collecting the data especially if the dataset is large. Another way you could explore is to use simple RDD transformations like groupByKey() and then use map to calculate the cumulative sum of each group by some key and then reduce it at the end.
Here is a simple example:
import pyspark
from pyspark.sql import window
import pyspark.sql.functions as sf
sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)
data = sqlcontext.createDataFrame([("Bob", "M", "Boston", 1, 20),
("Cam", "F", "Cambridge", 1, 25),
("Lin", "F", "Cambridge", 1, 25),
("Cat", "M", "Boston", 1, 20),
("Sara", "F", "Cambridge", 1, 15),
("Jeff", "M", "Cambridge", 1, 25),
("Bean", "M", "Cambridge", 1, 26),
("Dave", "M", "Cambridge", 1, 21),],
["name", 'gender', "city", 'donation', "age"])
data.show()
gives output
+----+------+---------+--------+---+
|name|gender| city|donation|age|
+----+------+---------+--------+---+
| Bob| M| Boston| 1| 20|
| Cam| F|Cambridge| 1| 25|
| Lin| F|Cambridge| 1| 25|
| Cat| M| Boston| 1| 20|
|Sara| F|Cambridge| 1| 15|
|Jeff| M|Cambridge| 1| 25|
|Bean| M|Cambridge| 1| 26|
|Dave| M|Cambridge| 1| 21|
+----+------+---------+--------+---+
Define a window
win_spec = (window.Window
.partitionBy(['gender', 'city'])
.rowsBetween(window.Window.unboundedPreceding, 0))
# window.Window.unboundedPreceding -- first row of the group
# .rowsBetween(..., 0) -- 0 refers to current row, if instead -2 specified then upto 2 rows before current row
Now, here is a trap:
temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
with error :
TypeErrorTraceback (most recent call last)
<ipython-input-9-b467d24b05cd> in <module>()
----> 1 temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
/Users/mupadhye/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.pyc in __iter__(self)
238
239 def __iter__(self):
--> 240 raise TypeError("Column is not iterable")
241
242 # string methods
TypeError: Column is not iterable
This is due to using python's sum function instead of pyspark's. The way to fix this is using sum function from pyspark.sql.functions.sum:
temp = data.withColumn('AgeSum',sf.sum(data.donation).over(win_spec))
temp.show()
will give:
+----+------+---------+--------+---+--------------+
|name|gender| city|donation|age|CumSumDonation|
+----+------+---------+--------+---+--------------+
|Sara| F|Cambridge| 1| 15| 1|
| Cam| F|Cambridge| 1| 25| 2|
| Lin| F|Cambridge| 1| 25| 3|
| Bob| M| Boston| 1| 20| 1|
| Cat| M| Boston| 1| 20| 2|
|Dave| M|Cambridge| 1| 21| 1|
|Jeff| M|Cambridge| 1| 25| 2|
|Bean| M|Cambridge| 1| 26| 3|
+----+------+---------+--------+---+--------------+
After landing on this thread trying to solve a similar problem, I've solved my issue using this code. Not sure if I'm missing part of the OP, but this is a way to sum a SQLContext column:
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
sc = SparkContext()
sc.setLogLevel("ERROR")
conf = SparkConf()
conf.setAppName('Sum SQLContext Column')
conf.set("spark.executor.memory", "2g")
sqlContext = SQLContext(sc)
def sum_column(table, column):
sc_table = sqlContext.table(table)
return sc_table.agg({column: "sum"})
sum_column("db.tablename", "column").show()
It is not true that windows function works only with HiveContext. You can use them even with sqlContext:
from pyspark.sql.window import *
myPartition=Window.partitionBy(['col1','col2','col3'])
temp= temp.withColumn("#dummy",sum(temp.col4).over(myPartition))
It seems like there is no support for replacing infinity values. I tried the code below and it doesn't work. Or am I missing out something?
a=sqlContext.createDataFrame([(None, None), (1, np.inf), (None, 2)])
a.replace(np.inf, 10)
Or do I have to take the painful route: convert PySpark DataFrame into pandas DataFrame, replace infinity values, and convert it back to PySpark DataFrame
It seems like there is no support for replacing infinity values.
Actually it looks like a Py4J bug not an issue with replace itself. See Support nan/inf between Python and Java.
As a workaround, you can try either UDF (slow option):
from pyspark.sql.types import DoubleType
from pyspark.sql.functions import col, lit, udf, when
df = sc.parallelize([(None, None), (1.0, np.inf), (None, 2.0)]).toDF(["x", "y"])
replace_infs_udf = udf(
lambda x, v: float(v) if x and np.isinf(x) else x, DoubleType()
)
df.withColumn("x1", replace_infs_udf(col("y"), lit(-99.0))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+
or expression like this:
def replace_infs(c, v):
is_infinite = c.isin([
lit("+Infinity").cast("double"),
lit("-Infinity").cast("double")
])
return when(c.isNotNull() & is_infinite, v).otherwise(c)
df.withColumn("x1", replace_infs(col("y"), lit(-99))).show()
## +----+--------+-----+
## | x| y| x1|
## +----+--------+-----+
## |null| null| null|
## | 1.0|Infinity|-99.0|
## |null| 2.0| 2.0|
## +----+--------+-----+