This question already has answers here:
Retrieve top n in each group of a DataFrame in pyspark
(6 answers)
Closed 4 years ago.
I have a dataframe grouped by 'id' and 'type':
+---+----+-----+
| id|type|count|
+---+----+-----+
| 0| A| 2|
| 0| B| 3|
| 0| C| 1|
| 0| D| 3|
| 0| G| 1|
| 1| A| 0|
| 1| C| 1|
| 1| D| 1|
| 1| G| 2|
+---+----+-----+
I would like now to group by 'id' and get a sum of 3 largest values:
+---+-----+
| id|count|
+---+-----+
| 0| 8|
| 1| 4|
+---+-----+
How can I do it in pyspark, so that the computation is relatively efficient?
Found solution here
You can use the following code to perform this
from pyspark.sql.functions import *
from pyspark.sql.window import Window
df = spark.createDataFrame([
(0, "A", 2),
(0,"B", 3),
(0,"C", 1),
(0,"D", 3),
(1,"A", 0),
(1,"C", 1),
(1,"D", 1),
(1,"G", 2)
], ("id", "type", "count"))
my_window = Window.partitionBy("id").orderBy("count")
df.withColumn("last_3", lead("count").over(my_window)).groupBy("id").agg(sum("last_3")).show()
Output:
+---+-----------+
| id|sum(last_3)|
+---+-----------+
| 0| 8|
| 1| 4|
+---+-----------+
Details: Window partitions your data by id and orders it by count then you create a new column where lead uses this window and returns the next value in that group (which was created by the window) so (0,C,1) is the lowest tuple in the group for id=0 this receive the value 2 since it is the next highest in this group (from tuple (0,A,2) and so on. The highest tuple does not have a following value and is assigned null. Finally perform the group operation and the sum.
Related
I have the following pyspark dataframe:
import pandas as pd
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
foo_df = spark.createDataFrame(foo)
foo_df.show()
+---+----+-----+
| id|time|value|
+---+----+-----+
| 1| 1| a|
| 1| 2| a|
| 1| 3| a|
| 1| 4| b|
| 1| 5| b|
| 2| 1| b|
| 2| 2| b|
| 2| 3| c|
| 2| 4| c|
| 2| 5| c|
+---+----+-----+
I would like, for a rolling time window of 3, to calculate the percentage of appearances of all the values, in the value column. The operation should be done by id.
The output dataframe would look something like this:
+---+------------------+------------------+------------------+
| id| perc_a| perc_b| perc_c|
+---+------------------+------------------+------------------+
| 1| 1.0| 0.0| 0.0|
| 1|0.6666666666666666|0.3333333333333333| 0.0|
| 1|0.3333333333333333|0.6666666666666666| 0.0|
| 2| 0.0|0.6666666666666666|0.3333333333333333|
| 2| 0.0|0.3333333333333333|0.6666666666666666|
| 2| 0.0| 0.0| 1.0|
+---+------------------+------------------+------------------+
Explanation of result:
for id=1, and the first window of (time=[1,2,3]), the value column contains only as. so the perc_a equals 100, and the rest is 0.
for id=1, and the second window of (time=[2,3,4]), the value column contains 2 as and 1 b, so the perc_a equals 66.6 the perc_b is 33.3 and the perc_c equals 0
etc
How could I achieve that in pyspark ?
EDIT
I am using pyspark 2.4
You can use count with a window function.
w = Window.partitionBy('id').orderBy('time').rowsBetween(Window.currentRow, 2)
df = (df.select('id', F.col('time').alias('window'),
*[(F.count(F.when(F.col('value') == x, 'value')).over(w)
/
F.count('value').over(w) * 100).alias(f'perc_{x}')
for x in ['a', 'b', 'c']])
.filter(F.col('time') < 4))
Clever answer by #Emma. Expanding the answer with a SparkSQL implementation.
The approach is to collect values over the intended sliding row range i.e ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING & filtering on time < 4 , further exploding the collected list to count the individual frequency , and finally pivoting it to the intended format
SparkSQL - Collect List
foo = pd.DataFrame({'id': [1,1,1,1,1, 2,2,2,2,2],
'time': [1,2,3,4,5, 1,2,3,4,5],
'value': ['a','a','a','b','b', 'b','b','c','c','c']})
sparkDF = sql.createDataFrame(foo)
sparkDF.registerTempTable("INPUT")
sql.sql("""
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
""").show()
+---+----+-----+----------+--------------+
| id|time|value|window_map|collected_list|
+---+----+-----+----------+--------------+
| 1| 1| a| 1| [a, a, a]|
| 1| 2| a| 2| [a, a, b]|
| 1| 3| a| 3| [a, b, b]|
| 1| 4| b| 4| [b, b]|
| 1| 5| b| 5| [b]|
| 2| 1| b| 1| [b, b, c]|
| 2| 2| b| 2| [b, c, c]|
| 2| 3| c| 3| [c, c, c]|
| 2| 4| c| 4| [c, c]|
| 2| 5| c| 5| [c]|
+---+----+-----+----------+--------------+
SparkSQL - Explode - Frequency Calculation
immDF = sql.sql(
"""
SELECT
id,
time,
exploded_value,
COUNT(*) as value_count
FROM (
SELECT
id,
time,
value,
window_map,
EXPLODE(collected_list) as exploded_value
FROM (
SELECT
id,
time,
value,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY time
) as window_map,
COLLECT_LIST(value) OVER(PARTITION BY id ORDER BY time
ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING
) as collected_list
FROM INPUT
)
WHERE window_map < 4 <--> Filtering List where values are less than 3
)
GROUP BY 1,2,3
ORDER BY id,time
;
"""
)
immDF.registerTempTable("IMM_RESULT")
immDF.show()
+---+----+--------------+-----------+
| id|time|exploded_value|value_count|
+---+----+--------------+-----------+
| 1| 1| a| 3|
| 1| 2| b| 1|
| 1| 2| a| 2|
| 1| 3| a| 1|
| 1| 3| b| 2|
| 2| 1| b| 2|
| 2| 1| c| 1|
| 2| 2| b| 1|
| 2| 2| c| 2|
| 2| 3| c| 3|
+---+----+--------------+-----------+
SparkSQL - Pivot
sql.sql("""
SELECT
id,
time,
ROUND(NVL(a,0),2) as perc_a,
ROUND(NVL(b,0),2) as perc_b,
ROUND(NVL(c,0),2) as perc_c
FROM IMM_RESULT
PIVOT (
MAX(value_count)/3 * 100.0
FOR exploded_value IN ('a'
,'b'
,'c'
)
)
""").show()
+---+----+------+------+------+
| id|time|perc_a|perc_b|perc_c|
+---+----+------+------+------+
| 1| 1| 100.0| 0.0| 0.0|
| 1| 2| 66.67| 33.33| 0.0|
| 1| 3| 33.33| 66.67| 0.0|
| 2| 1| 0.0| 66.67| 33.33|
| 2| 2| 0.0| 33.33| 66.67|
| 2| 3| 0.0| 0.0| 100.0|
+---+----+------+------+------+
I have a dataframe that looks like this
+-----------+-----------+-----------+
|salesperson| device|amount_sold|
+-----------+-----------+-----------+
| john| notebook| 2|
| gary| notebook| 3|
| john|small_phone| 2|
| mary|small_phone| 3|
| john|large_phone| 3|
| john| camera| 3|
+-----------+-----------+-----------+
and I have transformed it using pivot function to this with a Total column
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
+-----------+------+-----------+--------+-----------+-----+
but I would like a dataframe with a row (Total) that would also contain a total for every column like below:
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
| Total| 3| 3| 5| 5| 16|
+-----------+------+-----------+--------+-----------+-----+
Is it possible to do this is Spark using Scala/Python? (Preferably Scala and using Spark) and not using Union if possible
TIA
You can do something like below:
val columns = df.columns.dropWhile(_ == "salesperson").map(col)
//Use function `sum` on each column and union the result with original DataFrame.
val withTotalAsRow = df.union(df.select(lit("Total").as("salesperson") +: columns.map(sum):_*))
//I think this column already exists in DataFrame
//Append another column by adding value from each column
val withTotalAsColumn = withTotalAsRow.withColumn("Total", columns.reduce(_ plus _))
With spark Scala, you can achieve this using following snippet of code.
// Assuming spark session available as variable named 'spark'
import spark.implicits._
val resultDF = df.withColumn("Total", sum($"camera", $"large_phone", $"notebook", $"small_phone"))
I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1
IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.
You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()
The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.
I have the following dataset and working with PySpark
df = sparkSession.createDataFrame([(5, 'Samsung', '2018-02-23'),
(8, 'Apple', '2018-02-22'),
(5, 'Sony', '2018-02-21'),
(5, 'Samsung', '2018-02-20'),
(8, 'LG', '2018-02-20')],
['ID', 'Product', 'Date']
)
+---+-------+----------+
| ID|Product| Date|
+---+-------+----------+
| 5|Samsung|2018-02-23|
| 8| Apple|2018-02-22|
| 5| Sony|2018-02-21|
| 5|Samsung|2018-02-20|
| 8| LG|2018-02-20|
+---+-------+----------+
# Each ID will appear ALWAYS at least 2 times (do not consider the case of unique IDs in this df)
Each ID should increment the PRODUCT counter only when it represents the higher frequency.
In case of equal frequency, the most recent date should decide which product receives +1.
From the sample above, the desired output would be:
+-------+-------+
|Product|Counter|
+-------+-------+
|Samsung| 1|
| Apple| 1|
| Sony| 0|
| LG| 0|
+-------+-------+
# Samsung - 1 (preferred twice by ID=5)
# Apple - 1 (preferred by ID=8 more recently than LG)
# Sony - 0 (because ID=5 preferred Samsung 2 time, and Sony only 1)
# LG - 0 (because ID=8 preferred Apple more recently)
What is the most efficient way with PySpark to achieve this result?
IIUC, you want to pick the most frequent product for each ID, breaking ties using the
most recent Date
So first, we can get the count for each product/ID pair using:
import pyspark.sql.functions as f
from pyspark.sql import Window
df = df.select(
'ID',
'Product',
'Date',
f.count('Product').over(Window.partitionBy('ID', 'Product')).alias('count')
)
df.show()
#+---+-------+----------+-----+
#| ID|Product| Date|count|
#+---+-------+----------+-----+
#| 5| Sony|2018-02-21| 1|
#| 8| LG|2018-02-20| 1|
#| 8| Apple|2018-02-22| 1|
#| 5|Samsung|2018-02-23| 2|
#| 5|Samsung|2018-02-20| 2|
#+---+-------+----------+-----+
Now you can use a Window to rank each product for each ID. We can use pyspark.sql.functions.desc() to sort by count and Date descending. If the row_number() is equal to 1, that means that row is first.
w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df = df.select(
'Product',
(f.row_number().over(w) == 1).cast("int").alias('Counter')
)
df.show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung| 1|
#|Samsung| 0|
#| Sony| 0|
#| Apple| 1|
#| LG| 0|
#+-------+-------+
Finally groupBy() the Product and pick the value for maximum value for Counter:
df.groupBy('Product').agg(f.max('Counter').alias('Counter')).show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#| Sony| 0|
#|Samsung| 1|
#| LG| 0|
#| Apple| 1|
#+-------+-------+
Update
Here's a little bit of a simpler way:
w = Window.partitionBy('ID').orderBy(f.desc('count'), f.desc('Date'))
df.groupBy('ID', 'Product')\
.agg(f.max('Date').alias('Date'), f.count('Product').alias('Count'))\
.select('Product', (f.row_number().over(w) == 1).cast("int").alias('Counter'))\
.show()
#+-------+-------+
#|Product|Counter|
#+-------+-------+
#|Samsung| 1|
#| Sony| 0|
#| Apple| 1|
#| LG| 0|
#+-------+-------+
I have a dataframe in pyspark.
Say the has some columns a,b,c...
I want to group the data into groups as the value of column changes. Say
A B
1 x
1 y
0 x
0 y
0 x
1 y
1 x
1 y
There will be 3 groups as (1x,1y),(0x,0y,0x),(1y,1x,1y)
And corresponding row data
If I understand correctly you want to create a distinct group every time column A changes values.
First we'll create a monotonically increasing id to keep the row order as it is:
import pyspark.sql.functions as psf
df = sc.parallelize([[1,'x'],[1,'y'],[0,'x'],[0,'y'],[0,'x'],[1,'y'],[1,'x'],[1,'y']])\
.toDF(['A', 'B'])\
.withColumn("rn", psf.monotonically_increasing_id())
df.show()
+---+---+----------+
| A| B| rn|
+---+---+----------+
| 1| x| 0|
| 1| y| 1|
| 0| x| 2|
| 0| y| 3|
| 0| x|8589934592|
| 1| y|8589934593|
| 1| x|8589934594|
| 1| y|8589934595|
+---+---+----------+
Now we'll use a window function to create a column that contains 1 every time column A changes:
from pyspark.sql import Window
w = Window.orderBy('rn')
df = df.withColumn("changed", (df.A != psf.lag('A', 1, 0).over(w)).cast('int'))
+---+---+----------+-------+
| A| B| rn|changed|
+---+---+----------+-------+
| 1| x| 0| 1|
| 1| y| 1| 0|
| 0| x| 2| 1|
| 0| y| 3| 0|
| 0| x|8589934592| 0|
| 1| y|8589934593| 1|
| 1| x|8589934594| 0|
| 1| y|8589934595| 0|
+---+---+----------+-------+
Finally we'll use another window function to allocate different numbers to each group:
df = df.withColumn("group_id", psf.sum("changed").over(w)).drop("rn").drop("changed")
+---+---+--------+
| A| B|group_id|
+---+---+--------+
| 1| x| 1|
| 1| y| 1|
| 0| x| 2|
| 0| y| 2|
| 0| x| 2|
| 1| y| 3|
| 1| x| 3|
| 1| y| 3|
+---+---+--------+
Now you can build you groups