How to calculate cumulative sum using sqlContext - python

I know we can use Window function in pyspark to calculate cumulative sum. But Window is only supported in HiveContext and not in SQLContext. I need to use SQLContext as HiveContext cannot be run in multi processes.
Is there any efficient way to calculate cumulative sum using SQLContext? A simple way is to load the data into the driver's memory and use numpy.cumsum, but the con is the data need to be able to fit into the memory

Not sure if this is what you are looking for but here are two examples how to use sqlContext to calculate the cumulative sum:
First when you want to partition it by some categories:
from pyspark.sql.types import StructType, StringType, LongType
from pyspark.sql import SQLContext
rdd = sc.parallelize([
("Tablet", 6500),
("Tablet", 5500),
("Cell Phone", 6000),
("Cell Phone", 6500),
("Cell Phone", 5500)
])
schema = StructType([
StructField("category", StringType(), False),
StructField("revenue", LongType(), False)
])
df = sqlContext.createDataFrame(rdd, schema)
df.registerTempTable("test_table")
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (PARTITION BY category ORDER BY revenue) as cumsum
FROM
test_table
""")
Output:
[Row(category='Tablet', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=6500, cumsum=12000),
Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Cell Phone', revenue=6000, cumsum=11500),
Row(category='Cell Phone', revenue=6500, cumsum=18000)]
Second when you only want to take the cumsum of one variable. Change df2 to this:
df2 = sqlContext.sql("""
SELECT
category,
revenue,
sum(revenue) OVER (ORDER BY revenue, category) as cumsum
FROM
test_table
""")
Output:
[Row(category='Cell Phone', revenue=5500, cumsum=5500),
Row(category='Tablet', revenue=5500, cumsum=11000),
Row(category='Cell Phone', revenue=6000, cumsum=17000),
Row(category='Cell Phone', revenue=6500, cumsum=23500),
Row(category='Tablet', revenue=6500, cumsum=30000)]
Hope this helps. Using np.cumsum is not very efficient after collecting the data especially if the dataset is large. Another way you could explore is to use simple RDD transformations like groupByKey() and then use map to calculate the cumulative sum of each group by some key and then reduce it at the end.

Here is a simple example:
import pyspark
from pyspark.sql import window
import pyspark.sql.functions as sf
sc = pyspark.SparkContext(appName="test")
sqlcontext = pyspark.SQLContext(sc)
data = sqlcontext.createDataFrame([("Bob", "M", "Boston", 1, 20),
("Cam", "F", "Cambridge", 1, 25),
("Lin", "F", "Cambridge", 1, 25),
("Cat", "M", "Boston", 1, 20),
("Sara", "F", "Cambridge", 1, 15),
("Jeff", "M", "Cambridge", 1, 25),
("Bean", "M", "Cambridge", 1, 26),
("Dave", "M", "Cambridge", 1, 21),],
["name", 'gender', "city", 'donation', "age"])
data.show()
gives output
+----+------+---------+--------+---+
|name|gender| city|donation|age|
+----+------+---------+--------+---+
| Bob| M| Boston| 1| 20|
| Cam| F|Cambridge| 1| 25|
| Lin| F|Cambridge| 1| 25|
| Cat| M| Boston| 1| 20|
|Sara| F|Cambridge| 1| 15|
|Jeff| M|Cambridge| 1| 25|
|Bean| M|Cambridge| 1| 26|
|Dave| M|Cambridge| 1| 21|
+----+------+---------+--------+---+
Define a window
win_spec = (window.Window
.partitionBy(['gender', 'city'])
.rowsBetween(window.Window.unboundedPreceding, 0))
# window.Window.unboundedPreceding -- first row of the group
# .rowsBetween(..., 0) -- 0 refers to current row, if instead -2 specified then upto 2 rows before current row
Now, here is a trap:
temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
with error :
TypeErrorTraceback (most recent call last)
<ipython-input-9-b467d24b05cd> in <module>()
----> 1 temp = data.withColumn('cumsum',sum(data.donation).over(win_spec))
/Users/mupadhye/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.pyc in __iter__(self)
238
239 def __iter__(self):
--> 240 raise TypeError("Column is not iterable")
241
242 # string methods
TypeError: Column is not iterable
This is due to using python's sum function instead of pyspark's. The way to fix this is using sum function from pyspark.sql.functions.sum:
temp = data.withColumn('AgeSum',sf.sum(data.donation).over(win_spec))
temp.show()
will give:
+----+------+---------+--------+---+--------------+
|name|gender| city|donation|age|CumSumDonation|
+----+------+---------+--------+---+--------------+
|Sara| F|Cambridge| 1| 15| 1|
| Cam| F|Cambridge| 1| 25| 2|
| Lin| F|Cambridge| 1| 25| 3|
| Bob| M| Boston| 1| 20| 1|
| Cat| M| Boston| 1| 20| 2|
|Dave| M|Cambridge| 1| 21| 1|
|Jeff| M|Cambridge| 1| 25| 2|
|Bean| M|Cambridge| 1| 26| 3|
+----+------+---------+--------+---+--------------+

After landing on this thread trying to solve a similar problem, I've solved my issue using this code. Not sure if I'm missing part of the OP, but this is a way to sum a SQLContext column:
from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql.context import SQLContext
sc = SparkContext()
sc.setLogLevel("ERROR")
conf = SparkConf()
conf.setAppName('Sum SQLContext Column')
conf.set("spark.executor.memory", "2g")
sqlContext = SQLContext(sc)
def sum_column(table, column):
sc_table = sqlContext.table(table)
return sc_table.agg({column: "sum"})
sum_column("db.tablename", "column").show()

It is not true that windows function works only with HiveContext. You can use them even with sqlContext:
from pyspark.sql.window import *
myPartition=Window.partitionBy(['col1','col2','col3'])
temp= temp.withColumn("#dummy",sum(temp.col4).over(myPartition))

Related

How to use Round Function with groupBy in Pyspark?

How can we use the Round function with Group by in pyspark? i have a spark dataframe through which i need to generate a result by using group by and round function??
data1 = [{'Name':'Jhon','ID':21.528,'Add':'USA','ID_2':'30.90'},
{'Name':'Joe','ID':3.69,'Add':'USA','ID_2':'12.80'},
{'Name':'Tina','ID':2.48,'Add':'IND','ID_2':'11.07'},
{'Name':'Jhon','ID':22.22, 'Add':'USA','ID_2':'34.87'},
{'Name':'Joe','ID':5.33,'Add':'INA','ID_2':'56.89'}]
a = sc.parallelize(data1)
In SQL query will be like
select count(ID) as newid, count(ID_2) as secondaryid, round(([newid]+
[secondaryid])/[newid]* 200,1) AS [NEW_PERCENTAGE] FROM DATA1
groupby Name
You cannot use round inside a groupby, you need to create a new column afterwards:
import pyspark.sql.functions as F
df = spark.createDataFrame(a)
(df.groupby('Name')
.agg(
F.count('ID').alias('newid'),
F.count('ID_2').alias('secondaryid')
)
.withColumn('NEW_PERCENTAGE', F.round(200 * (F.col('newid') + F.col('secondaryid')) / F.col('newid'), 1))
).show()
+----+-----+-----------+--------------+
|Name|newid|secondaryid|NEW_PERCENTAGE|
+----+-----+-----------+--------------+
| Joe| 2| 2| 400.0|
|Tina| 1| 1| 400.0|
|Jhon| 2| 2| 400.0|
+----+-----+-----------+--------------+

How to detect monotonic decrease in pyspark

I am working with a spark DataFrame where I would like to detect any value from a specific column where the value does not monotonically decrease. For those values, I would like to replace them with the previous value according to the ordering criteria.
Here is a conceptual example, if I have a column of value [65, 66, 62, 100, 40]. the value "100" is not following the monotonic decrease trend and therefore should be replaced by 62. So the resulting list will be [65, 66, 62, 62, 40].
Below is some code that I created to detect the value that must be replaced however I don't know how to replace the value by the previous and also how to ignore the initial null value from the lag.
from pyspark import SparkContext
from pyspark.sql import SQLContext
from pyspark.sql import functions as psf
from pyspark.sql.window import Window
sc = SparkContext(appName="sample-app")
sqlc = SQLContext(sc)
rdd = sc.parallelize([(1, 65), (2, 66), (3, 62), (4, 100), (5, 40)])
df = sqlc.createDataFrame(rdd, ["id", "value"])
window = Window.orderBy(df.id).rowsBetween(-1, -1)
sdf = df.withColumn(
"__monotonic_col",
(df.value <= psf.lag(df.value, 1).over(window)) & df.value.isNotNull(),
)
sdf.show()
This code produce the following output:
+---+-----+---------------+
| id|value|__monotonic_col|
+---+-----+---------------+
| 1| 65| null|
| 2| 66| false|
| 3| 62| true|
| 4| 100| false|
| 5| 40| true|
+---+-----+---------------+
Firstly, if my understanding is correct, shouldn't the 66 also be replaced (by 65) as it does not follow the decreasing trend?
If that is the correct interpretation, then the following should work (I have added an extra column to keep things tidy, but you could wrap everything into a single column creation statement):
from pyspark.sql import functions as F
sdf = sdf.withColumn(
"__monotonic_col_value",
F.when(
F.col("__monotonic_col") | F.col("__monotonic_col").isNull(), df.value)
.otherwise(
F.lag(df.value, 1).over(window)
),
)

How to run exponential weighted moving average in pyspark

I am trying to run exponential weighted moving average in PySpark using a Grouped Map Pandas UDF. It doesn't work though:
def ExpMA(myData):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql import SQLContext
df = myData
group_col = 'Name'
sort_col = 'Date'
schema = df.select(group_col, sort_col,'count').schema
print(schema)
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
Model = pd.DataFrame(pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean()))
return Model
data = df.groupby('Name').apply(ema)
return data
I also tried running it without the Pandas udf, just writing the ewma equation in PySpark, but the problem there is that the ewma equation contains the lag of the current ewma.
First of all your Pandas code is incorrect. This just won't work, Spark or not
pdf.apply(lambda x: x['count'].ewm(span=5, min_periods=1).mean())
Another problem is the output schema, which depending on your data, won't really accommodate the result:
If want to add ewm schema should be extended.
If you want to return only ewm then schema is to large.
If you want to just replace, it might not match the type.
Let's assume this is the first scenario (I allowed myself to rewrite your code a bit):
from pyspark.sql.functions import pandas_udf
from pyspark.sql.functions import PandasUDFType
from pyspark.sql.types import DoubleType, StructField
def exp_ma(df, group_col='Name', sort_col='Date'):
schema = (df.select(group_col, sort_col, 'count')
.schema.add(StructField('ewma', DoubleType())))
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def ema(pdf):
pdf['ewm'] = pdf['count'].ewm(span=5, min_periods=1).mean()
return pdf
return df.groupby('Name').apply(ema)
df = spark.createDataFrame(
[("a", 1, 1), ("a", 2, 3), ("a", 3, 3), ("b", 1, 10), ("b", 8, 3), ("b", 9, 0)],
("name", "date", "count")
)
exp_ma(df).show()
# +----+----+-----+------------------+
# |Name|Date|count| ewma|
# +----+----+-----+------------------+
# | b| 1| 10| 10.0|
# | b| 8| 3| 5.800000000000001|
# | b| 9| 0|3.0526315789473686|
# | a| 1| 1| 1.0|
# | a| 2| 3| 2.2|
# | a| 3| 3| 2.578947368421052|
# +----+----+-----+------------------+
I don't use much Pandas so there might be more elegant way of doing this.

Split spark dataframe by column value and get x number of rows per column value in the result

I have the following spark dataframe, and I am trying to split this up by column value, and return a new dataframe containing x number of rows for each column value
Suppose that this is the dataframe I have:
from pyspark import *;
from pyspark.sql import *;
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, StructType, StructField, IntegerType, DoubleType
import math;
sc = SparkContext.getOrCreate();
spark = SparkSession.builder.master('local').getOrCreate();
schema = StructType([
StructField("INDEX", IntegerType(), True),
StructField("SYMBOL", StringType(), True),
StructField("DATETIMETS", StringType(), True),
StructField("PRICE", DoubleType(), True),
StructField("SIZE", IntegerType(), True),
])
df = spark\
.createDataFrame(
data=[(0,'A','2002-12-01 9:30:20',19.75,30200),
(1,'A','2002-12-02 9:31:20',29.75,30200),
(2,'A','2004-12-03 10:36:20',3.0,30200),
(3,'A','2006-12-06 22:41:20',24.0,30200),
(4,'A','2006-12-08 22:42:20',60.0,30200),
(5,'B','2002-12-09 9:30:20',15.75,30200),
(6,'B','2002-12-12 9:31:20',49.75,30200),
(7,'C','2004-11-02 10:36:20',6.0,30200),
(8,'C','2007-12-02 22:41:20',50.0,30200),
(9,'D','2008-12-02 22:42:20',60.0,30200),
(10,'E','2052-12-02 9:30:20',14.75,30200),
(11,'A','2062-12-02 9:31:20',12.75,30200),
(12,'A','2007-12-02 11:36:20',5.0,30200),
(13,'A','2008-12-02 22:41:20',40.0,30200),
(14,'A','2008-12-02 22:42:20',50.0,30200)],
schema=schema);
Say I want at most two rows per symbol, i.e. create a new dataframe with the following data.
Is there a way to do this other than looping though each dataset by using a 'where' clause for the symbol ?
Here is one option taking the first two rows from each SYMBOL:
df.rdd.groupBy(lambda r: r['SYMBOL']).flatMap(lambda x: list(x[1])[:2]).toDF().show()
+-----+------+-------------------+-----+-----+
|INDEX|SYMBOL| DATETIMETS|PRICE| SIZE|
+-----+------+-------------------+-----+-----+
| 0| A| 2002-12-01 9:30:20|19.75|30200|
| 1| A| 2002-12-02 9:31:20|29.75|30200|
| 10| E| 2052-12-02 9:30:20|14.75|30200|
| 9| D|2008-12-02 22:42:20| 60.0|30200|
| 7| C|2004-11-02 10:36:20| 6.0|30200|
| 8| C|2007-12-02 22:41:20| 50.0|30200|
| 5| B| 2002-12-09 9:30:20|15.75|30200|
| 6| B| 2002-12-12 9:31:20|49.75|30200|
+-----+------+-------------------+-----+-----+

Spark withColumn() performing power functions

I have a data frame df with columns "col1" and "col2". I want to create a third column which uses one of the columns as in an exponent function.
df = df.withColumn("col3", 100**(df("col1")))*df("col2")
However, this always results in:
TypeError: unsupported operand type(s) for ** or pow(): 'float' and 'Column'
I understand that this is due to the function taking df("col1") as a "Column" instead of the item at that row.
If I perform
results = df.map(lambda x : 100**(df("col2"))*df("col2"))
this works, but I can't append to my original data frame.
Any thoughts?
This is my first time posting, so I apologize for any formatting problems.
Since Spark 1.4 you can usepow function as follows:
from pyspark.sql import Row
from pyspark.sql.functions import pow, col
row = Row("col1", "col2")
df = sc.parallelize([row(1, 2), row(2, 3), row(3, 3)]).toDF()
df.select("*", pow(col("col1"), col("col2")).alias("pow")).show()
## +----+----+----+
## |col1|col2| pow|
## +----+----+----+
## | 1| 2| 1.0|
## | 2| 3| 8.0|
## | 3| 3|27.0|
## +----+----+----+
If you use an older version a Python UDF should do the trick:
import math
from pyspark.sql.functions import udf
from pyspark.sql.types import DoubleType
my_pow = udf(lambda x, y: math.pow(x, y), DoubleType())
Just to complement the accepted answer: one can now do something very similar to what the OP tried to do, i.e., use the ** operator, or even Python's builtin pow:
from pyspark.sql import SparkSession
from pyspark.sql.functions import pow as pow_
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([(1, ), (2, ), (3, ), (4, ), (5, ), (6, )], 'n: int')
df = df.withColumn('pyspark_pow', pow_(df['n'], df['n'])) \
.withColumn('python_pow', pow(df['n'], df['n'])) \
.withColumn('double_star_operator', df['n'] ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 1.0| 1.0| 1.0|
| 2| 4.0| 4.0| 4.0|
| 3| 27.0| 27.0| 27.0|
| 4| 256.0| 256.0| 256.0|
| 5| 3125.0| 3125.0| 3125.0|
| 6| 46656.0| 46656.0| 46656.0|
+---+-----------+----------+--------------------+
As one can see, both PySpark's and Python's pow return the same result, as well as the ** operator. It also works when one of the arguments is a scalar:
df = df.withColumn('pyspark_pow', pow_(2, df['n'])) \
.withColumn('python_pow', pow(2, df['n'])) \
.withColumn('double_star_operator', 2 ** df['n'])
df.show()
+---+-----------+----------+--------------------+
| n|pyspark_pow|python_pow|double_star_operator|
+---+-----------+----------+--------------------+
| 1| 2.0| 2.0| 2.0|
| 2| 4.0| 4.0| 4.0|
| 3| 8.0| 8.0| 8.0|
| 4| 16.0| 16.0| 16.0|
| 5| 32.0| 32.0| 32.0|
| 6| 64.0| 64.0| 64.0|
+---+-----------+----------+--------------------+
I believe the reason Python's pow now work on PySpark columns, is the fact that pow is equivalent to the ** operator when used with only two arguments (see docs, here), and the ** operator uses the objects own implementation of the power operation, if it is defined for the object being operated on (see this SO response here).
Apparently, PySpark's Column has the proper definitions for __pow__ operator (see source code for Column).
I am not sure why the ** operator did not work originally, but I am assuming it is related to the fact that - at the time - Column was defined differently.
The stack used for testing was Python 3.8.5 and PySpark 3.1.1, but I have seen this behavior for PySpark >= 2.4 as well.

Categories

Resources