Labelling duplicates in PySpark

Labelling duplicates in PySpark - python

I am trying to label the duplicates in my PySpark DataFrame based on their group, while having the full length data frame. Below is an example code.
data= [
("A", "2018-01-03"),
("A", "2018-01-03"),
("A", "2018-01-03"),
("B", "2019-01-03"),
("B", "2019-01-03"),
("B", "2019-01-03"),
("C", "2020-01-03"),
("C", "2020-01-03"),
("C", "2020-01-03"),
]
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark= SparkSession.builder.getOrCreate()
df= spark.createDataFrame(data=data, schema=["Group", "Date"])
df= df.withColumn("Date", F.to_date("Date", "yyyy-MM-dd"))
from pyspark.sql import Window
windowSpec= Window.partitionBy("Group").orderBy(F.asc("Date"))
df.withColumn("group_number", F.dense_rank().over(windowSpec)).orderBy("Date").show()
This is my current output and although it is correct since the code ranks "Date" based on its group but that was not my desired outcome.
+-----+----------+------------+
|Group| Date|group_number|
+-----+----------+------------+
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| B|2019-01-03| 1|
| B|2019-01-03| 1|
| B|2019-01-03| 1|
| C|2020-01-03| 1|
| C|2020-01-03| 1|
| C|2020-01-03| 1|
+-----+----------+------------+
I was hoping my output to look like this
+-----+----------+------------+
|Group| Date|group_number|
+-----+----------+------------+
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| A|2018-01-03| 1|
| B|2019-01-03| 2|
| B|2019-01-03| 2|
| B|2019-01-03| 2|
| C|2020-01-03| 3|
| C|2020-01-03| 3|
| C|2020-01-03| 3|
+-----+----------+------------+
Any suggestions? I have found this post but this is just a binary solution! I have more than 2 groups in my dataset.

You don't need to use the partitionBy function when you declare your windowSpec. By specifying the column "Group" in partionBy, you're telling the program to do a dense_rank() for each partition based on the "Date". So the output is correct. If we look at group A, they have the same dates, thus they all have a group_rank of 1. Moving on to group B, they all have the same dates, thus they have a group rank of 1.
So a quick fix for your problem is to remove the partionBy in your windowSpec.
EDIT: If you were to group by the Group column, the following is another solution: you can use a user defined function (UDF) as the second argument parameter in the df.withColumn(). In this UDF, you would specify your input/output like a normal function. Something like this:
import pyspark.sql.functions import udf
def new_column(group):
return ord(group) - 64 # Unicode integer equivalent as A is 65
funct = udf(new_column, IntegerType())
df.withColumn("group_number", funct(df["Group"])).orderBy("Date").show()
If you were to use UDF for for the Date, you would need some way to keep track of Dates. An example:
import datetime
date_dict = {}
def new_column(date_obj):
if len(date_dict) > 0 and date_dict[date_obj.strftime("%Y-%m-%d")]:
return date_dict[date_obj.strftime("%Y-%m-%d")]
date_dict[date_obj.strftime("%Y-%m-%d")] = len(date_obj.strftime("%Y-%m-%d")) + 1
return date_dict[date_obj.strftime("%Y-%m-%d")]

What you want is to rank over all the groups not in each group so you don't need to partition by the Window, just order by Group and Date will give you the desired output:
windowSpec = Window.orderBy(F.asc("Group"), F.asc("Date"))
df.withColumn("group_number", F.dense_rank().over(windowSpec)).orderBy("Date").show()
#+-----+----------+------------+
#|Group| Date|group_number|
#+-----+----------+------------+
#| A|2018-01-03| 1|
#| A|2018-01-03| 1|
#| A|2018-01-03| 1|
#| B|2019-01-03| 2|
#| B|2019-01-03| 2|
#| B|2019-01-03| 2|
#| C|2020-01-03| 3|
#| C|2020-01-03| 3|
#| C|2020-01-03| 3|
#+-----+----------+------------+
And you surely don't need any UDF as the other answer suggests.

Related

PySpark: Pass value as suffix to dataframe name

I have a PySpark dataframe df and want to add an "iteration suffix". For every iteration, counter should be raised by 1 and added as suffix to the dataframe name. For test purposes, my code looks like this:
counter = 1
def loop:
counter = counter + 1
df_%s = df.select('A','B') % counter
2 problems here: I don't know how to set up the counter variable as this version runs into an error ('local variable 'counter' referenced before assignment') and I don't know how to correctly pass the current counter value to the dataframe name. Thanks for your help!

Given the following dataframe
+---+------+-----+
| A| B| C|
+---+------+-----+
| 1| Red| 5.52|
| 2| Blue| 1.99|
| 3| Green| 3.71|
| 4|Purple|12.09|
+---+------+-----+
You can get the result with the following
for i in range(0, 9):
globals()['df_{}'.format(i)] = df.select("A","B")
Now you have 10 different dataframes to operate on
from pyspark.sql import functions
df_1 = df_1.withColumn("test", functions.lit(1))
df_1.show()
+---+------+----+
| A| B|Test|
+---+------+----+
| 1| Red| 1|
| 2| Blue| 1|
| 3| Green| 1|
| 4|Purple| 1|
+---+------+----+
df_2.show()
+---+------+
| A| B|
+---+------+
| 1| Red|
| 2| Blue|
| 3| Green|
| 4|Purple|
+---+------+
#and so on..

Spark create a row containing a sum for every column (like a grand total for every column)

I have a dataframe that looks like this
+-----------+-----------+-----------+
|salesperson| device|amount_sold|
+-----------+-----------+-----------+
| john| notebook| 2|
| gary| notebook| 3|
| john|small_phone| 2|
| mary|small_phone| 3|
| john|large_phone| 3|
| john| camera| 3|
+-----------+-----------+-----------+
and I have transformed it using pivot function to this with a Total column
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
+-----------+------+-----------+--------+-----------+-----+
but I would like a dataframe with a row (Total) that would also contain a total for every column like below:
+-----------+------+-----------+--------+-----------+-----+
|salesperson|camera|large_phone|notebook|small_phone|Total|
+-----------+------+-----------+--------+-----------+-----+
| gary| 0| 0| 3| 0| 3|
| mary| 0| 0| 0| 3| 3|
| john| 3| 3| 2| 2| 10|
| Total| 3| 3| 5| 5| 16|
+-----------+------+-----------+--------+-----------+-----+
Is it possible to do this is Spark using Scala/Python? (Preferably Scala and using Spark) and not using Union if possible
TIA

You can do something like below:
val columns = df.columns.dropWhile(_ == "salesperson").map(col)
//Use function `sum` on each column and union the result with original DataFrame.
val withTotalAsRow = df.union(df.select(lit("Total").as("salesperson") +: columns.map(sum):_*))
//I think this column already exists in DataFrame
//Append another column by adding value from each column
val withTotalAsColumn = withTotalAsRow.withColumn("Total", columns.reduce(_ plus _))

With spark Scala, you can achieve this using following snippet of code.
// Assuming spark session available as variable named 'spark'
import spark.implicits._
val resultDF = df.withColumn("Total", sum($"camera", $"large_phone", $"notebook", $"small_phone"))

Combine two rows in Pyspark if a condition is met

I have a PySpark data table that looks like the following
shouldMerge | number
true | 1
true | 1
true | 2
false | 3
false | 1
I want to combine all of the columns with shouldMerge as true and add up the numbers.
so the final output would look like
shouldMerge | number
true | 4
false | 3
false | 1
How can I select all the ones with shouldMerge == true, add up the numbers, and generate a new row in PySpark?
Edit: Alternate, slightly more complicated scenario closer to what I'm trying to solve, where we only aggregate positive numbers:
mergeId | number
1 | 1
2 | 1
1 | 2
-1 | 3
-1 | 1
shouldMerge | number
1 | 3
2 | 1
-1 | 3
-1 | 1

IIUC, you want to do a groupBy but only on the positive mergeIds.
One way is to filter your DataFrame for the positive ids, group, aggregate, and union this back with the negative ids (similar to #shanmuga's answer).
Other way would be use when to dynamically create a grouping key. If the mergeId is positive, use the mergeId to group. Otherwise, use a monotonically_increasing_id to ensure that the row does not get aggregated.
Here is an example:
import pyspark.sql.functions as f
df.withColumn("uid", f.monotonically_increasing_id())\
.groupBy(
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey"),
f.col("mergeId")
)\
.agg(f.sum("number").alias("number"))\
.drop("mergeKey")\
.show()
#+-------+------+
#|mergeId|number|
#+-------+------+
#| -1| 1.0|
#| 1| 3.0|
#| 2| 1.0|
#| -1| 3.0|
#+-------+------+
This can easily be generalized by changing the when condition (in this case it's f.col("mergeId") > 0) to match your specific requirements.
Explanation:
First we create a temporary column uid which is a unique ID for each row. Next, we call groupBy and if the mergeId is positive use the mergeId to group. Otherwise we use the uid as the mergeKey. I also passed in the mergeId as a second group by column as a way to keep that column for the output.
To demonstrate what is going on, take a look at the intermediate result:
df.withColumn("uid", f.monotonically_increasing_id())\
.withColumn(
"mergeKey",
f.when(
f.col("mergeId") > 0,
f.col("mergeId")
).otherwise(f.col("uid")).alias("mergeKey")
)\
.show()
#+-------+------+-----------+-----------+
#|mergeId|number| uid| mergeKey|
#+-------+------+-----------+-----------+
#| 1| 1| 0| 1|
#| 2| 1| 8589934592| 2|
#| 1| 2|17179869184| 1|
#| -1| 3|25769803776|25769803776|
#| -1| 1|25769803777|25769803777|
#+-------+------+-----------+-----------+
As you can see, the mergeKey remains the unique value for the negative mergeIds.
From this intermediate step, the desired result is just a trivial group by and sum, followed by dropping the mergeKey column.

You will have to filter out only the rows where should merge is true and aggregate. then union this with all the remaining rows.
import pyspark.sql.functions as functions
df = sqlContext.createDataFrame([
(True, 1),
(True, 1),
(True, 2),
(False, 3),
(False, 1),
], ("shouldMerge", "number"))
false_df = df.filter("shouldMerge = false")
true_df = df.filter("shouldMerge = true")
result = true_df.groupBy("shouldMerge")\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
df = sqlContext.createDataFrame([
(1, 1),
(2, 1),
(1, 2),
(-1, 3),
(-1, 1),
], ("mergeId", "number"))
merge_condition = df["mergeId"] > -1
remaining = ~merge_condition
grouby_field = "mergeId"
false_df = df.filter(remaining)
true_df = df.filter(merge_condition)
result = true_df.groupBy(grouby_field)\
.agg(functions.sum("number").alias("number"))\
.unionAll(false_df)
result.show()

The first problem posted by the OP.
# Create the DataFrame
valuesCol = [(True,1),(True,1),(True,2),(False,3),(False,1)]
df = sqlContext.createDataFrame(valuesCol,['shouldMerge','number'])
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| true| 1|
| true| 1|
| true| 2|
| false| 3|
| false| 1|
+-----------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select shouldMerge, number, sum(number) over (partition by shouldMerge) as sum_number from table_view'
)
df = df.withColumn('number',when(col('shouldMerge')==True,col('sum_number')).otherwise(col('number')))
df.show()
+-----------+------+----------+
|shouldMerge|number|sum_number|
+-----------+------+----------+
| true| 4| 4|
| true| 4| 4|
| true| 4| 4|
| false| 3| 4|
| false| 1| 4|
+-----------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy().orderBy('shouldMerge')
df = df.withColumn('shouldMerge_lag', lag(col('shouldMerge'),1).over(my_window))
df.show()
+-----------+------+---------------+
|shouldMerge|number|shouldMerge_lag|
+-----------+------+---------------+
| false| 3| null|
| false| 1| false|
| true| 4| false|
| true| 4| true|
| true| 4| true|
+-----------+------+---------------+
df = df.where(~((col('shouldMerge')==True) & (col('shouldMerge_lag')==True))).drop('shouldMerge_lag')
df.show()
+-----------+------+
|shouldMerge|number|
+-----------+------+
| false| 3|
| false| 1|
| true| 4|
+-----------+------+
For the second problem posted by the OP
# Create the DataFrame
valuesCol = [(1,2),(1,1),(2,1),(1,2),(-1,3),(-1,1)]
df = sqlContext.createDataFrame(valuesCol,['mergeId','number'])
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 2|
| 1| 1|
| 2| 1|
| 1| 2|
| -1| 3|
| -1| 1|
+-------+------+
# Packages to be imported
from pyspark.sql.window import Window
from pyspark.sql.functions import when, col, lag
# Register the dataframe as a view
df.registerTempTable('table_view')
df=sqlContext.sql(
'select mergeId, number, sum(number) over (partition by mergeId) as sum_number from table_view'
)
df = df.withColumn('number',when(col('mergeId') > 0,col('sum_number')).otherwise(col('number')))
df.show()
+-------+------+----------+
|mergeId|number|sum_number|
+-------+------+----------+
| 1| 5| 5|
| 1| 5| 5|
| 1| 5| 5|
| 2| 1| 1|
| -1| 3| 4|
| -1| 1| 4|
+-------+------+----------+
df = df.drop('sum_number')
my_window = Window.partitionBy('mergeId').orderBy('mergeId')
df = df.withColumn('mergeId_lag', lag(col('mergeId'),1).over(my_window))
df.show()
+-------+------+-----------+
|mergeId|number|mergeId_lag|
+-------+------+-----------+
| 1| 5| null|
| 1| 5| 1|
| 1| 5| 1|
| 2| 1| null|
| -1| 3| null|
| -1| 1| -1|
+-------+------+-----------+
df = df.where(~((col('mergeId') > 0) & (col('mergeId_lag').isNotNull()))).drop('mergeId_lag')
df.show()
+-------+------+
|mergeId|number|
+-------+------+
| 1| 5|
| 2| 1|
| -1| 3|
| -1| 1|
+-------+------+
Documentation: lag() - Returns the value that is offset rows before the current row.

Pyspark - "Recursive" function involving last day

I'm working on a process in pyspark which I have a dataframe and I'm trying to add one more column (using withColumn method).
The problem is that the formula is:
STATUS1 = If 'PETP-today' > 0 then 'Status1 last day' + 'PETP-today' else 0
Each result for Status1 involves status1 from the last day result.
One solution I found was to create a pandas dataframe and run the records one by one till I can calculate each, using variables. However I'll have performance issues. Can you help?
Consider the dataframe columns: Date (daily) / PETP (Float)/ STATUS1? (Float)
I really appreciate any help!

I think the key to your solution is the lag function. Try this (for simplicity, I am assuming integer data for all columns):
First, shift the column by one day up
import pyspark
from pyspark.sql import SparkSession
from pyspark import SparkContext
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql import Window
sc = SparkContext.getOrCreate()
spark = SparkSession(sc)
columns = ['date', 'petp', 'status']
data = [(0, 0, 0), (1, 1, 1), (2, 2, 2), (3,3,3), (4,4,4), (5,5,5)]
pd_data = pd.DataFrame.from_records(data=data, columns=columns)
spark_data = spark.createDataFrame(pd_data)
spark_data_with_lag = spark_data.withColumn("status_last_day", F.lag("status", 1, 0).over(Window.orderBy("date")))
spark_data_with_lag.show()
+----+----+------+---------------+
|date|petp|status|status_last_day|
+----+----+------+---------------+
| 1| 1| 1| 0|
| 2| 2| 2| 1|
| 3| 3| 3| 2|
| 4| 4| 4| 3|
| 5| 5| 5| 4|
+----+----+------+---------------+
Then use that data in your conditional
status2 = spark_data_with_lag.withColumn("status2", F.when(F.col("date") > 0, F.col("petp") + F.col("status_last_day")).otherwise(0))
status2.show()
+----+----+------+---------------+-------+
|date|petp|status|status_last_day|status2|
+----+----+------+---------------+-------+
| 1| 1| 1| 0| 1|
| 2| 2| 2| 1| 3|
| 3| 3| 3| 2| 5|
| 4| 4| 4| 3| 7|
| 5| 5| 5| 4| 9|
+----+----+------+---------------+-------+
I hope that is what you were looking for.

count values in multiple columns that contain a substring based on strings of lists pyspark

I have a data frame in Pyspark like below. I want to count values in two columns based on some lists and populate new columns for each list
df.show()
+---+-------------+-------------_+
| id| device| device_model|
+---+-------------+--------------+
| 3| mac pro| mac|
| 1| iphone| iphone5|
| 1|android phone| android|
| 1| windows pc| windows|
| 1| spy camera| spy camera|
| 2| | camera|
| 2| iphone| apple iphone|
| 3| spy camera| |
| 3| cctv| cctv|
+---+-------------+--------------+
lists are below:
phone_list = ['iphone', 'android', 'nokia']
pc_list = ['windows', 'mac']
security_list = ['camera', 'cctv']
I want to count the device and device_model for each id and pivot the values in a new data frame.
I want to count the values in the both the device_model and device columns for each id that match the strings in the list.
For example: in phone_list I have a iphone string this should count values for both values iphone and iphone5
The result I want
+---+------+----+--------+
| id|phones| pc|security|
+---+------+----+--------+
| 1| 4| 2| 2|
| 2| 2|null| 1|
| 3| null| 2| 3|
+---+------+----+--------+
I have done like below
df.withColumn('cat',
F.when(df.device.isin(phone_list), 'phones').otherwise(
F.when(df.device.isin(pc_list), 'pc').otherwise(
F.when(df.device.isin(security_list), 'security')))
).groupBy('id').pivot('cat').agg(F.count('cat')).show()
Using the above I can only do for device column and only if the string matches exactly. But unable to figure out how to do for both the columns and when value contains the string.
How can I achieve the result I want?

Here is the working solution . I have used udf function for checking the strings and calculating sum. You can use inbuilt functions if possible. (comments are provided as a means for explanation)
#creating dictionary for the lists with names for columns
columnLists = {'phone':phone_list, 'pc':pc_list, 'security':security_list}
#udf function for checking the strings and summing them
from pyspark.sql import functions as F
from pyspark.sql import types as t
def checkDevices(device, deviceModel, name):
sum = 0
for x in columnLists[name]:
if x in device:
sum += 1
if x in deviceModel:
sum += 1
return sum
checkDevicesAndSum = F.udf(checkDevices, t.IntegerType())
#populating the sum returned from udf function to respective columns
for x in columnLists:
df = df.withColumn(x, checkDevicesAndSum(F.col('device'), F.col('device_model'), F.lit(x)))
#finally grouping and sum
df.groupBy('id').agg(F.sum('phone').alias('phone'), F.sum('pc').alias('pc'), F.sum('security').alias('security')).show()
which should give you
+---+-----+---+--------+
| id|phone| pc|security|
+---+-----+---+--------+
| 3| 0| 2| 3|
| 1| 4| 2| 2|
| 2| 2| 0| 1|
+---+-----+---+--------+
Aggrgation part can be generalized as the rest of the parts. Improvements and modification is all in your hand. :)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Labelling duplicates in PySpark - python

Related

PySpark: Pass value as suffix to dataframe name

Spark create a row containing a sum for every column (like a grand total for every column)

Combine two rows in Pyspark if a condition is met

Pyspark - "Recursive" function involving last day

count values in multiple columns that contain a substring based on strings of lists pyspark

Categories

Resources