Compare two dataframe in pyspark and change column value

Compare two dataframe in pyspark and change column value - python

I have two pyspark dataframes like this:
df1:
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
df2:
+------------+---+
|src_language|abb|
+------------+---+
| Java| J|
| Python| P|
| Scala| S|
+------------+---+
I want to compare these two dataframes and replace the column value in df1 with abb in df2. So the output will be:
|language|users_count|
+--------+-----------+
| J | 20000|
| P | 100000|
| S | 3000|
+--------+-----------+
How can I achieve this?

You can easily do this with join - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join
Data Preparation
df1 = pd.read_csv(StringIO("""language,users_count
Java,20000
Python,100000
Scala,3000
""")
,delimiter=','
)
df2 = pd.read_csv(StringIO("""src_language,abb
Java,J
Python,P
Scala,S
""")
,delimiter=','
)
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show(truncate=False)
+----------------------+-----------+
|language |users_count|
+----------------------+-----------+
| Java |20000 |
| Python|100000 |
| Scala |3000 |
+----------------------+-----------+
sparkDF2.show()
+------------+---+
|src_language|abb|
+------------+---+
| Java| J|
| Python| P|
| Scala| S|
+------------+---+
Join
finalDF = sparkDF1.join(sparkDF2
,F.trim(sparkDF1['language']) == F.trim(sparkDF2['src_language'])
,'inner'
).select(sparkDF2['abb'].alias('language')
,sparkDF1['users_count']
)
finalDF.show(truncate=False)
+--------+-----------+
|language|users_count|
+--------+-----------+
|S |3000 |
|P |100000 |
|J |20000 |
+--------+-----------+

You can simply join the two dataframes and then simply rename the column name to get the required output.
#Sample Data :
columns = ['language','users_count']
data = [("Java","20000"), ("Python","100000"), ("Scala","3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
columns1 = ['src_language','abb']
data1 = [("Java","J"), ("Python","P"), ("Scala","S")]
rdd1 = spark.sparkContext.parallelize(data1)
df1 = rdd1.toDF(columns1)
#Joining dataframes and doing required transformation
df2 = df.join(df1, df.language == df1.src_language,"inner").select("abb","users_count").withColumnRenamed("abb","language")
Once you perform show or display on the dataframe you can see the output as below :

Related

How to update PySpark df on the basis of other PySpark df?

df1
+------------------------------------------------------
|ID| NAME|ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+------------------------------------------------------
| 1|sravan|delhi |false |25/01/2023 |25/01/2023|
| 2|ojasvi|patna |false |25/01/2023 |25/01/2023|
| 3|rohith|jaipur |false |25/01/2023 |25/01/2023|
df2
+----------
|ID| NAME|
+----------
| 1|sravan|
| 2|ojasvi|
Suppose I have two pyspark df's (df1 and df2)
How can I get the result df3 like below given ID and NAME are the keys?
df3
+------------------------------------------------------
|ID| NAME|ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+------------------------------------------------------
| 1|sravan|delhi |true |25/01/2023 |02/02/2023|
| 2|ojasvi|patna |true |25/01/2023 |02/02/2023|
| 3|rohith|jaipur |false |25/01/2023 |25/01/2023|
I am looking more for a generic answer where I state the keys within a list or store it as a string.

You can achieve your desired results by doing something like this,
# List of keys on which the join should happen
keys = ['ID', 'NAME']
df1_cols = [f"a.{i}" for i in df1.columns]
# Use 'inner' as it only gives you records that are present in both and select only the columns from left df as that is what we want
delete_records = df1.alias('a').join(df2.alias('b'), keys, 'inner').select(*df1_cols)
# Set all the flags to True as they are deleted
delete_records = delete_records.withColumn('DELETE_FLAG', F.lit('true'))
print("Deleted records:")
delete_records.show(truncate=False)
# First remove the older records from df1 matched with delete records and then upsert them into df1, optional but if you want yo can use orderBy to keep the order according to the keys.
df1 = df1.join(delete_records, keys, 'anti').union(delete_records).orderBy(keys)
print("Updated DF1:")
df1.show(truncate=False)
Output:
Deleted records:
+---+------+-------+-----------+-----------+-----------+
|ID |NAME |ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+---+------+-------+-----------+-----------+-----------+
| 1 |sravan|delhi |true |25/01/2023 |25/01/2023 |
| 2 |ojasvi|patna |true |25/01/2023 |25/01/2023 |
+---+------+-------+-----------+-----------+-----------+
Updated DF1:
+---+------+-------+-----------+-----------+-----------+
|ID |NAME |ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+---+------+-------+-----------+-----------+-----------+
| 1 |sravan|delhi |true |25/01/2023 |25/01/2023 |
| 2 |ojasvi|patna |true |25/01/2023 |25/01/2023 |
| 3 |rohith|jaipur |false |25/01/2023 |25/01/2023 |
+---+------+-------+-----------+-----------+-----------+

PySpark filter DataFrame where values in a column do not exist in another DataFrame column

I don't understand why this isn't working in PySpark...
I'm trying to split the data into an approved DataFrame and a rejected DataFrame based on column values. So rejected looks at the language column values in approved and only returns rows where the language does not exist in the approved DataFrame's language column:
# Data
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)
df.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# | Java| 20000|
# | Python| 100000|
# | Scala| 3000|
# | C++| 10000|
# | C#| 32195432|
# | C| 238135|
# | R| 134315|
# | Ruby| 235|
# | C| 1000|
# | R| 2000|
# | Ruby| 4000|
# +--------+-----------+
# Approved
is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# | Java| 20000|
# | Python| 100000|
# | C#| 32195432|
# | C| 238135|
# | R| 134315|
# +--------+-----------+
# Rejected
is_not_approved = ~df.language.isin(df_approved.language)
df_rejected = df.filter(is_not_approved)
df_rejected.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+
# Also tried
df.filter( ~df.language.contains(df_approved.language) ).show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# +--------+-----------+
So that doesn't make any sense - why is df_rejected empty?
Expected outcomes using other approaches:
SQL:
SELECT * FROM df
WHERE language NOT IN ( SELECT language FROM df_approved )
Python:
data_approved = []
for language, users_count in data:
if users_count > 10000:
data_approved.append((language, users_count))
data_rejected = []
for language, users_count in data:
if language not in [row[0] for row in data_approved]:
data_rejected.append((language, users_count))
print(data_approved)
print(data_rejected)
# [('Java', 20000), ('Python', 100000), ('C#', 32195432), ('C', 238135), ('R', 134315)]
# [('Scala', 3000), ('C++', 10000), ('Ruby', 235), ('Ruby', 4000)]
Why is PySpark not filtering as expected?

First of all you will want to use a window to select the maximum user_count of rows by language.
from pyspark.sql import Window
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)
df = (df.withColumn('max_users_count',
functions.max('users_count')
.over(w))
.where(functions.col('users_count')
== functions.col('max_users_count'))
.drop('max_users_count'))
df.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| C#| 32195432|
| C++| 10000|
| C| 238135|
| R| 134315|
| Scala| 3000|
| Ruby| 4000|
| Python| 100000|
| Java| 20000|
+--------+-----------+
Then you can filter based on the specified condition.
is_approved = df.users_count > 10000
df_approved = df.filter(is_approved)
df_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| C#| 32195432|
| C| 238135|
| R| 134315|
+--------+-----------+
And then for the reverse of the condition, add the ~ symbol in the filter statement
is_not_approved = df.filter(~is_approved)
is_not_approved.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| Scala| 3000|
| C++| 10000|
| Ruby| 235|
| C| 1000|
| R| 2000|
| Ruby| 4000|
+--------+-----------+

Went the SQL route:
columns = ["language", "users_count"]
data = [("Java", 20000), ("Python", 100000), ("Scala", 3000), ("C++", 10000), ("C#", 32195432), ("C", 238135), ("R", 134315), ("Ruby", 235), ("C", 1000), ("R", 2000), ("Ruby", 4000)]
df = spark.createDataFrame(data, columns)
df_approved = df.filter(df.users_count > 10000)
df.createOrReplaceTempView("df")
df_approved.createOrReplaceTempView("df_approved")
df_not_approved = spark.sql("""
SELECT * FROM df WHERE NOT EXISTS (
SELECT 1 FROM df_approved
WHERE df.language = df_approved.language
)
""")
df_not_approved.show()
# +--------+-----------+
# |language|users_count|
# +--------+-----------+
# | C++| 10000|
# | Ruby| 235|
# | Ruby| 4000|
# | Scala| 3000|
# +--------+-----------+

Try to:
df.subtract(df_approved).show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| R| 2000|
| Ruby| 4000|
| Scala| 3000|
| C| 1000|
| C++| 10000|
| Ruby| 235|
+--------+-----------+
UPD:
If you want to proceed with existing code without duplicates use python list instead of Column spark object.
import pyspark.sql.functions as f
column_list= df_approved.select(f.collect_list('language')).first()[0]
# output ['Java', 'Python', 'C#', 'C', 'R']
is_not_approved = ~df.language.isin(column_list)
df_rejected = df.filter(is_not_approved)
df_rejected.show()
+--------+-----------+
|language|users_count|
+--------+-----------+
| Scala| 3000|
| C++| 10000|
| Ruby| 235|
| Ruby| 4000|
+--------+-----------+

Pyspark - Subtract columns from two different dataframes

I have two dataframes below:
df1: df2:
+------------+------------+-----------+ +-----------+-------------+-----------+
| date |Advertiser |Impressions| | date |Advertiser |Impressions|
+------------+------------+-----------+ +-----------+-------------+-----------+
|2020-01-08 |b |50035 | | 2020-01-07|b |10000 |
|2020-01-08 |c |70000 | | 2020-01-07|c |25260 |
+------------+------------+-----------+ +-----------+-------------+-----------+
I would like to do df1(Impressions) - df2(Impressions), and save it to a new dataframe df3.
+------------+------------+----------------+
| date |Advertiser |diff Impressions|
+------------+------------+----------------+
|2020-01-08 |b |40035 |
|2020-01-08 |c |44740 |
+------------+------------+----------------+

You can join the two dataframes using the Advertiser column and make appropriate selections:
df3 = df1.join(df2, 'Advertiser').select(
df1.date,
'Advertiser',
(df1.Impressions - df2.Impressions).alias('diff Impressions')
)
df3.show()
+----------+----------+----------------+
| date|Advertiser|diff Impressions|
+----------+----------+----------------+
|2020-01-08| b| 40035|
|2020-01-08| c| 44740|
+----------+----------+----------------+

Compare two dataframes Pyspark

I'm trying to compare two data frames with have same number of columns i.e. 4 columns with id as key column in both data frames
df1 = spark.read.csv("/path/to/data1.csv")
df2 = spark.read.csv("/path/to/data2.csv")
Now I want to append new column to DF2 i.e. column_names which is the list of the columns with different values than df1
df2.withColumn("column_names",udf())
DF1
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | UK |
| 3| GHI | 3000 | JPN |
| 4| JKL | 4500 | CHN |
+------+---------+--------+------+
DF2:
+------+---------+--------+------+
| id | |name | sal | Address |
+------+---------+--------+------+
| 1| ABC | 5000 | US |
| 2| DEF | 4000 | CAN |
| 3| GHI | 3500 | JPN |
| 4| JKL_M | 4800 | CHN |
+------+---------+--------+------+
Now I want DF3
DF3:
+------+---------+--------+------+--------------+
| id | |name | sal | Address | column_names |
+------+---------+--------+------+--------------+
| 1| ABC | 5000 | US | [] |
| 2| DEF | 4000 | CAN | [address] |
| 3| GHI | 3500 | JPN | [sal] |
| 4| JKL_M | 4800 | CHN | [name,sal] |
+------+---------+--------+------+--------------+
I saw this SO question, How to compare two dataframe and print columns that are different in scala. Tried that, however the result is different.
I'm thinking of going with a UDF function by passing row from each dataframe to udf and compare column by column and return column list. However for that both the data frames should be in sorted order so that same id rows will be sent to udf. Sorting is costly operation here. Any solution?

Assuming that we can use id to join these two datasets I don't think that there is a need for UDF. This could be solved just by using inner join, array and array_remove functions among others.
First let's create the two datasets:
df1 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "UK"],
[3, "GHI", 3000, "JPN"],
[4, "JKL", 4500, "CHN"]
], ["id", "name", "sal", "Address"])
df2 = spark.createDataFrame([
[1, "ABC", 5000, "US"],
[2, "DEF", 4000, "CAN"],
[3, "GHI", 3500, "JPN"],
[4, "JKL_M", 4800, "CHN"]
], ["id", "name", "sal", "Address"])
First we do an inner join between the two datasets then we generate the condition df1[col] != df2[col] for each column except id. When the columns aren't equal we return the column name otherwise an empty string. The list of conditions will consist the items of an array from which finally we remove the empty items:
from pyspark.sql.functions import col, array, when, array_remove
# get conditions for all columns except id
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c != 'id']
select_expr =[
col("id"),
*[df2[c] for c in df2.columns if c != 'id'],
array_remove(array(*conditions_), "").alias("column_names")
]
df1.join(df2, "id").select(*select_expr).show()
# +---+-----+----+-------+------------+
# | id| name| sal|Address|column_names|
# +---+-----+----+-------+------------+
# | 1| ABC|5000| US| []|
# | 3| GHI|3500| JPN| [sal]|
# | 2| DEF|4000| CAN| [Address]|
# | 4|JKL_M|4800| CHN| [name, sal]|
# +---+-----+----+-------+------------+

Here is your solution with UDF, I have changed first dataframe name dynamically so that it will be not ambiguous during check. Go through below code and let me know in case any concerns.
>>> from pyspark.sql.functions import *
>>> df.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| UK|
| 3| GHI|3000| JPN|
| 4| JKL|4500| CHN|
+---+----+----+-------+
>>> df1.show()
+---+----+----+-------+
| id|name| sal|Address|
+---+----+----+-------+
| 1| ABC|5000| US|
| 2| DEF|4000| CAN|
| 3| GHI|3500| JPN|
| 4|JKLM|4800| CHN|
+---+----+----+-------+
>>> df2 = df.select([col(c).alias("x_"+c) for c in df.columns])
>>> df3 = df1.join(df2, col("id") == col("x_id"), "left")
//udf declaration
>>> def CheckMatch(Column,r):
... check=''
... ColList=Column.split(",")
... for cc in ColList:
... if(r[cc] != r["x_" + cc]):
... check=check + "," + cc
... return check.replace(',','',1).split(",")
>>> CheckMatchUDF = udf(CheckMatch)
//final column that required to select
>>> finalCol = df1.columns
>>> finalCol.insert(len(finalCol), "column_names")
>>> df3.withColumn("column_names", CheckMatchUDF(lit(','.join(df1.columns)),struct([df3[x] for x in df3.columns])))
.select(finalCol)
.show()
+---+----+----+-------+------------+
| id|name| sal|Address|column_names|
+---+----+----+-------+------------+
| 1| ABC|5000| US| []|
| 2| DEF|4000| CAN| [Address]|
| 3| GHI|3500| JPN| [sal]|
| 4|JKLM|4800| CHN| [name, sal]|
+---+----+----+-------+------------+

Python: PySpark version of my previous scala code.
import pyspark.sql.functions as f
df1 = spark.read.option("header", "true").csv("test1.csv")
df2 = spark.read.option("header", "true").csv("test2.csv")
columns = df1.columns
df3 = df1.alias("d1").join(df2.alias("d2"), f.col("d1.id") == f.col("d2.id"), "left")
for name in columns:
df3 = df3.withColumn(name + "_temp", f.when(f.col("d1." + name) != f.col("d2." + name), f.lit(name)))
df3.withColumn("column_names", f.concat_ws(",", *map(lambda name: f.col(name + "_temp"), columns))).select("d1.*", "column_names").show()
Scala: Here is my best approach for your problem.
val df1 = spark.read.option("header", "true").csv("test1.csv")
val df2 = spark.read.option("header", "true").csv("test2.csv")
val columns = df1.columns
val df3 = df1.alias("d1").join(df2.alias("d2"), col("d1.id") === col("d2.id"), "left")
columns.foldLeft(df3) {(df, name) => df.withColumn(name + "_temp", when(col("d1." + name) =!= col("d2." + name), lit(name)))}
.withColumn("column_names", concat_ws(",", columns.map(name => col(name + "_temp")): _*))
.show(false)
First, I join two dataframe into df3 and used the columns from df1. By folding left to the df3 with temp columns that have the value for column name when df1 and df2 has the same id and other column values.
After that, concat_ws for those column names and the null's are gone away and only the column names are left.
+---+----+----+-------+------------+
|id |name|sal |Address|column_names|
+---+----+----+-------+------------+
|1 |ABC |5000|US | |
|2 |DEF |4000|UK |Address |
|3 |GHI |3000|JPN |sal |
|4 |JKL |4500|CHN |name,sal |
+---+----+----+-------+------------+
The only thing different from your expected result is that the output is not a list but string.
p.s. I forgot to use PySpark but this is the normal spark, sorry.

You can get that query build for you in PySpark and Scala by the spark-extension package.
It provides the diff transformation that does exactly that.
from gresearch.spark.diff import *
options = DiffOptions().with_change_column('changes')
df1.diff_with_options(df2, options, 'id').show()
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
|diff| changes| id|left_name|right_name|left_sal|right_sal|left_Address|right_Address|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
| N| []| 1| ABC| ABC| 5000| 5000| US| US|
| C| [Address]| 2| DEF| DEF| 4000| 4000| UK| CAN|
| C| [sal]| 3| GHI| GHI| 3000| 3500| JPN| JPN|
| C|[name, sal]| 4| JKL| JKL_M| 4500| 4800| CHN| CHN|
+----+-----------+---+---------+----------+--------+---------+------------+-------------+
While this is a simple example, diffing DataFrames can become complicated when wide schemas, insertions, deletions and null values are involved. That package is well-tested, so you don't have to worry about getting that query right yourself.

There is a wonderful package for pyspark that compares two dataframes. The name of the package is datacompy
https://capitalone.github.io/datacompy/
example code:
import datacompy as dc
comparison = dc.SparkCompare(spark, base_df=df1, compare_df=df2, join_columns=common_keys, match_rates=True)
comparison.report()
The above code will generate a summary report, and the one below it will give you the mismatches.
comparison.rows_both_mismatch.display()
There are also more fearures that you can explore.

Filtering Based on Dates and ID's from Two Dataframes: Pyspark

I have two dataframes:
DF1:
+----------+-----------+-----------+
| ID|Dx_Min_Date|Dx_Max_Date|
| 30794324| 2014-04-21| 2015-07-01|
| 31234323| 2013-07-04| 2017-05-02|
+----------+-----------+-----------+
DF2:
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
| 30794324| 12| 2017-08-02|
| 54321367| 14| 2014-05-02|
+----------+-----------+-----------+
I want to filter the Dataframe DF2 based upon the ID's of DF1 and being between the Min and Max dates as given by the columns Dx_Min_Date and Dx_Max_Date. Resulting in:
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
+----------+-----------+-----------+
Is there a way to filter based on columns of one dataframe for another?

Use non equi join:
df2.alias('tmp').join(
df1,
(df2.ID == df1.ID) &
(df2.Date >= df1.Dx_Min_Date) &
(df2.Date <= df1.Dx_Max_Date)
).select('tmp.*').show()
+--------+---------+----------+
| ID|Procedure| Date|
+--------+---------+----------+
|30794324| 32|2014-06-21|
|30794324| 14|2014-04-25|
+--------+---------+----------+

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Compare two dataframe in pyspark and change column value - python

Related

How to update PySpark df on the basis of other PySpark df?

PySpark filter DataFrame where values in a column do not exist in another DataFrame column

Pyspark - Subtract columns from two different dataframes

Compare two dataframes Pyspark

Filtering Based on Dates and ID's from Two Dataframes: Pyspark

Categories

Resources