How to update PySpark df on the basis of other PySpark df? - python

df1
+------------------------------------------------------
|ID| NAME|ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+------------------------------------------------------
| 1|sravan|delhi |false |25/01/2023 |25/01/2023|
| 2|ojasvi|patna |false |25/01/2023 |25/01/2023|
| 3|rohith|jaipur |false |25/01/2023 |25/01/2023|
df2
+----------
|ID| NAME|
+----------
| 1|sravan|
| 2|ojasvi|
Suppose I have two pyspark df's (df1 and df2)
How can I get the result df3 like below given ID and NAME are the keys?
df3
+------------------------------------------------------
|ID| NAME|ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+------------------------------------------------------
| 1|sravan|delhi |true |25/01/2023 |02/02/2023|
| 2|ojasvi|patna |true |25/01/2023 |02/02/2023|
| 3|rohith|jaipur |false |25/01/2023 |25/01/2023|
I am looking more for a generic answer where I state the keys within a list or store it as a string.

You can achieve your desired results by doing something like this,
# List of keys on which the join should happen
keys = ['ID', 'NAME']
df1_cols = [f"a.{i}" for i in df1.columns]
# Use 'inner' as it only gives you records that are present in both and select only the columns from left df as that is what we want
delete_records = df1.alias('a').join(df2.alias('b'), keys, 'inner').select(*df1_cols)
# Set all the flags to True as they are deleted
delete_records = delete_records.withColumn('DELETE_FLAG', F.lit('true'))
print("Deleted records:")
delete_records.show(truncate=False)
# First remove the older records from df1 matched with delete records and then upsert them into df1, optional but if you want yo can use orderBy to keep the order according to the keys.
df1 = df1.join(delete_records, keys, 'anti').union(delete_records).orderBy(keys)
print("Updated DF1:")
df1.show(truncate=False)
Output:
Deleted records:
+---+------+-------+-----------+-----------+-----------+
|ID |NAME |ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+---+------+-------+-----------+-----------+-----------+
| 1 |sravan|delhi |true |25/01/2023 |25/01/2023 |
| 2 |ojasvi|patna |true |25/01/2023 |25/01/2023 |
+---+------+-------+-----------+-----------+-----------+
Updated DF1:
+---+------+-------+-----------+-----------+-----------+
|ID |NAME |ADDRESS|DELETE_FLAG|INSERT_DATE|UPDATE_DATE|
+---+------+-------+-----------+-----------+-----------+
| 1 |sravan|delhi |true |25/01/2023 |25/01/2023 |
| 2 |ojasvi|patna |true |25/01/2023 |25/01/2023 |
| 3 |rohith|jaipur |false |25/01/2023 |25/01/2023 |
+---+------+-------+-----------+-----------+-----------+

Related

How to add or update a column value using joins in pyspark

Need to find how can we update a column in a dataframe say df based on equality condition on another dataframe say ds.
In order to reproduce this issue, you can just copy paste the same to your notebook.
Example:
df is created as
data = [["101", "sravan", "vignan"],
["102", "ramya", "vvit"],
["103", "rohith", "klu"],
["104", "sridevi", "vignan"],
["105", "gnanesh", "iit"]]
columns = ['rollNo', 'name', 'lastName']
df = spark.createDataFrame(data=data, schema=columns, verifySchema=True)
another data frame is ds:
ds = spark.createDataFrame(
[
("100", "2015-01-01", "2015-01-01T13:51:39.340396Z"),
("101", "2015-01-01", "2015-01-01T12:14:58.597216Z"),
("102", "2015-01-01", "2015-01-01T13:51:40.417052Z"),
("103", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("107", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("108", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
("109", "2015-01-01", "2015-01-01T13:51:40.519832Z"),
],
["rollNo", "creation_date", "last_update_time"]
)
Now we perform join(leftanti) join to create dt to find non-common items on the left table:
dt = df.join(ds, ["rollNo"], "leftanti")
dt.show(5,False)
df.show(5,False)
response of leftanti join dt dataframe is like below
+------+-------+--------+
|rollNo|name |lastName|
+------+-------+--------+
|104 |sridevi|vignan |
|105 |gnanesh|iit |
+------+-------+--------+
original df dataframe is
+------+-------+--------+
|rollNo|name |lastName|
+------+-------+--------+
|101 |sravan |vignan |
|102 |ramya |vvit |
|103 |rohith |klu |
|104 |sridevi|vignan |
|105 |gnanesh|iit |
+------+-------+--------+
Issue is when trying to add another column to df based on an equality condition on both dataframes as mentioned below, it isn't giving expected response, which is to update only those rows which has same rollNo to true and rest to false:
df.withColumn('is_deleted', when(dt.rollNo == df.rollNo, True).otherwise(False)).show(5,False)
Response received:
+------+-------+--------+----------+
|rollNo|name |lastName|is_deleted|
+------+-------+--------+----------+
|101 |sravan |vignan |true |
|102 |ramya |vvit |true |
|103 |rohith |klu |true |
|104 |sridevi|vignan |true |
|105 |gnanesh|iit |true |
+------+-------+--------+----------+
Response expected:
+------+-------+--------+----------+
|rollNo|name |lastName|is_deleted|
+------+-------+--------+----------+
|101 |sravan |vignan |false |
|102 |ramya |vvit |false |
|103 |rohith |klu |false |
|104 |sridevi|vignan |true | # this shud be updated to true
|105 |gnanesh|iit |true | # this shud be updated to true
+------+-------+--------+----------+
Use left_outer instead. Code below
dt = df.join(ds.select('rollNo',col('creation_date').alias('is_deleted')), ["rollNo"], "left_outer").withColumn('is_deleted',when(col('is_deleted').isNull(),True).otherwise(False))
dt.show(truncate=False)
+------+-------+--------+----------+
|rollNo|name |lastName|is_deleted|
+------+-------+--------+----------+
|101 |sravan |vignan |false |
|102 |ramya |vvit |false |
|103 |rohith |klu |false |
|104 |sridevi|vignan |true |
|105 |gnanesh|iit |true |
+------+-------+--------+----------+
We can create a temp list and then use isin method to filter out rows. Something like below:
rolls = []
dt = df.join(ds, ["rollNo"], "leftanti")
dataCollect = dt.collect()
for row in dataCollect:
print(row['rollNo'])
rolls.append(row['rollNo'])
print(rolls)
dt.show(5,False)
df.show(5,False)
df.withColumn('is_deleted', when(df.rollNo.isin(rolls), True).otherwise(False)).show(5,False)

Compare two dataframe in pyspark and change column value

I have two pyspark dataframes like this:
df1:
|language|users_count|
+--------+-----------+
| Java| 20000|
| Python| 100000|
| Scala| 3000|
+--------+-----------+
df2:
+------------+---+
|src_language|abb|
+------------+---+
| Java| J|
| Python| P|
| Scala| S|
+------------+---+
I want to compare these two dataframes and replace the column value in df1 with abb in df2. So the output will be:
|language|users_count|
+--------+-----------+
| J | 20000|
| P | 100000|
| S | 3000|
+--------+-----------+
How can I achieve this?
You can easily do this with join - https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join
Data Preparation
df1 = pd.read_csv(StringIO("""language,users_count
Java,20000
Python,100000
Scala,3000
""")
,delimiter=','
)
df2 = pd.read_csv(StringIO("""src_language,abb
Java,J
Python,P
Scala,S
""")
,delimiter=','
)
sparkDF1 = sql.createDataFrame(df1)
sparkDF2 = sql.createDataFrame(df2)
sparkDF1.show(truncate=False)
+----------------------+-----------+
|language |users_count|
+----------------------+-----------+
| Java |20000 |
| Python|100000 |
| Scala |3000 |
+----------------------+-----------+
sparkDF2.show()
+------------+---+
|src_language|abb|
+------------+---+
| Java| J|
| Python| P|
| Scala| S|
+------------+---+
Join
finalDF = sparkDF1.join(sparkDF2
,F.trim(sparkDF1['language']) == F.trim(sparkDF2['src_language'])
,'inner'
).select(sparkDF2['abb'].alias('language')
,sparkDF1['users_count']
)
finalDF.show(truncate=False)
+--------+-----------+
|language|users_count|
+--------+-----------+
|S |3000 |
|P |100000 |
|J |20000 |
+--------+-----------+
You can simply join the two dataframes and then simply rename the column name to get the required output.
#Sample Data :
columns = ['language','users_count']
data = [("Java","20000"), ("Python","100000"), ("Scala","3000")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
columns1 = ['src_language','abb']
data1 = [("Java","J"), ("Python","P"), ("Scala","S")]
rdd1 = spark.sparkContext.parallelize(data1)
df1 = rdd1.toDF(columns1)
#Joining dataframes and doing required transformation
df2 = df.join(df1, df.language == df1.src_language,"inner").select("abb","users_count").withColumnRenamed("abb","language")
Once you perform show or display on the dataframe you can see the output as below :

Pyspark - Subtract columns from two different dataframes

I have two dataframes below:
df1: df2:
+------------+------------+-----------+ +-----------+-------------+-----------+
| date |Advertiser |Impressions| | date |Advertiser |Impressions|
+------------+------------+-----------+ +-----------+-------------+-----------+
|2020-01-08 |b |50035 | | 2020-01-07|b |10000 |
|2020-01-08 |c |70000 | | 2020-01-07|c |25260 |
+------------+------------+-----------+ +-----------+-------------+-----------+
I would like to do df1(Impressions) - df2(Impressions), and save it to a new dataframe df3.
+------------+------------+----------------+
| date |Advertiser |diff Impressions|
+------------+------------+----------------+
|2020-01-08 |b |40035 |
|2020-01-08 |c |44740 |
+------------+------------+----------------+
You can join the two dataframes using the Advertiser column and make appropriate selections:
df3 = df1.join(df2, 'Advertiser').select(
df1.date,
'Advertiser',
(df1.Impressions - df2.Impressions).alias('diff Impressions')
)
df3.show()
+----------+----------+----------------+
| date|Advertiser|diff Impressions|
+----------+----------+----------------+
|2020-01-08| b| 40035|
|2020-01-08| c| 44740|
+----------+----------+----------------+

is there a way to get date difference over a certain column

i want to calculate the time difference/date difference for each unique name it took for the status to get from order to arrived.
Input dataframe is like this
+------------------------------+
| Date | id | name |staus
+------------------------------+
| 1986/10/15| A |john |order
| 1986/10/16| A |john |dispatched
| 1986/10/18| A |john |arrived
| 1986/10/15| B |peter|order
| 1986/10/16| B |peter|dispatched
| 1986/10/17| B |peter|arrived
| 1986/10/16| C |raul |order
| 1986/10/17| C |raul |dispatched
| 1986/10/18| C |raul |arrived
+-----------------------------+
the expected output dataset should look similar to this
+---------------------------------------------------+
| id | name |time_difference_from_order_to_delivered|
+---------------------------------------------------+
A | john | 3days
B |peter | 2days
C | Raul | 2days
+---------------------------------------------------+
I am stuck on what logic to implement
You can group by and calculate the date diff using a conditional aggregation:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
F.datediff(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd'),
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3|
| C| raul| 2|
| B|peter| 2|
+---+-----+---------+
You can also directly subtract the dates, which will return an interval type column:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd') -
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3 days|
| C| raul| 2 days|
| B|peter| 2 days|
+---+-----+---------+
Assuming ordered is the earliest date and delivered is the last, just use aggregation and datediff():
select id, name, datediff(max(date), min(date)) as num_days
from t
group by id, name;
For more precision, you can use conditional aggregation:
select id, name,
datediff(max(case when status = 'arrived' then date end)
min(case when status = 'order' then date end)
) as num_days
from t
group by id, name;

Filtering Based on Dates and ID's from Two Dataframes: Pyspark

I have two dataframes:
DF1:
+----------+-----------+-----------+
| ID|Dx_Min_Date|Dx_Max_Date|
| 30794324| 2014-04-21| 2015-07-01|
| 31234323| 2013-07-04| 2017-05-02|
+----------+-----------+-----------+
DF2:
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
| 30794324| 12| 2017-08-02|
| 54321367| 14| 2014-05-02|
+----------+-----------+-----------+
I want to filter the Dataframe DF2 based upon the ID's of DF1 and being between the Min and Max dates as given by the columns Dx_Min_Date and Dx_Max_Date. Resulting in:
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
+----------+-----------+-----------+
Is there a way to filter based on columns of one dataframe for another?
Use non equi join:
df2.alias('tmp').join(
df1,
(df2.ID == df1.ID) &
(df2.Date >= df1.Dx_Min_Date) &
(df2.Date <= df1.Dx_Max_Date)
).select('tmp.*').show()
+--------+---------+----------+
| ID|Procedure| Date|
+--------+---------+----------+
|30794324| 32|2014-06-21|
|30794324| 14|2014-04-25|
+--------+---------+----------+

Categories

Resources