Pyspark - Subtract columns from two different dataframes - python

I have two dataframes below:
df1: df2:
+------------+------------+-----------+ +-----------+-------------+-----------+
| date |Advertiser |Impressions| | date |Advertiser |Impressions|
+------------+------------+-----------+ +-----------+-------------+-----------+
|2020-01-08 |b |50035 | | 2020-01-07|b |10000 |
|2020-01-08 |c |70000 | | 2020-01-07|c |25260 |
+------------+------------+-----------+ +-----------+-------------+-----------+
I would like to do df1(Impressions) - df2(Impressions), and save it to a new dataframe df3.
+------------+------------+----------------+
| date |Advertiser |diff Impressions|
+------------+------------+----------------+
|2020-01-08 |b |40035 |
|2020-01-08 |c |44740 |
+------------+------------+----------------+

You can join the two dataframes using the Advertiser column and make appropriate selections:
df3 = df1.join(df2, 'Advertiser').select(
df1.date,
'Advertiser',
(df1.Impressions - df2.Impressions).alias('diff Impressions')
)
df3.show()
+----------+----------+----------------+
| date|Advertiser|diff Impressions|
+----------+----------+----------------+
|2020-01-08| b| 40035|
|2020-01-08| c| 44740|
+----------+----------+----------------+

Related

Finding the rows along with the row number of first dataframe not found in second dataframe using Pyspark

I am looking to check some large amount of data in GBs containing 2 CSVs. CSV files have no headers and also include only column which has some complex string mixture of numbers and alphabets like this
+--------------------------------+
| _c0 |
+---+---------------------------+
| Hello | world | 1.3123.412 | B |
+---+----------------------------+
So far, I am able to converted into the dataframes but not sure , Is there any way to get the row numbers and rows of df1 not found in df2
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
file1 = 'file_path'
file2 = 'file_path'
df1 = spark.read.csv(file1)
df2 = spark.read.csv(file2)
df1.show(truncate=False)
Lets go step by step now that you are still learning
df1
+------------------------------+
|_c0 |
+------------------------------+
|Hello | world | 1.3123.412 | B|
|Hello | world | 1.3123.412 | C|
+------------------------------+
df2
+------------------------------+
|_c0 |
+------------------------------+
|Hello | world | 1.3123.412 | D|
|Hello | world | 1.3123.412 | C|
+------------------------------+
generate row numbers using a window function
df1= df1.withColumn('id', row_number().over(Window.orderBy('_c0')))
df2= df2.withColumn('id', row_number().over(Window.orderBy('_c0')))
Use a left semi join. These joins do not keep any values from the right datframe. They only compare values and keep the left dataframes values that are also found in the right dataframe
df1.join(df2, how='left_semi', on='_c0').show(truncate=False)
+------------------------------+---+
|_c0 |id |
+------------------------------+---+
|Hello | world | 1.3123.412 | C|2 |
+------------------------------+---+

Splitting string type dictionary in PySpark dataframe into individual columns [duplicate]

I have a delta table which has a column with JSON data. I do not have a schema for it and need a way to convert the JSON data into columns
|id | json_data
| 1 | {"name":"abc", "depts":["dep01", "dep02"]}
| 2 | {"name":"xyz", "depts":["dep03"],"sal":100}
| 3 | {"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}
Expected output
|id | name | depts | sal | address_city
| 1 | "abc" | ["dep01", "dep02"] | null| null
| 2 | "xyz" | ["dep03"] | 100 | null
| 3 | "pqr" | ["dep02"] | null| "SF"
Input Dataframe -
df = spark.createDataFrame(data = [(1 , """{"name":"abc", "depts":["dep01", "dep02"]}"""), (2 , """{"name":"xyz", "depts":["dep03"],"sal":100}"""), (3 , """{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}""")], schema = ["id", "json_data"])
df.show(truncate=False)
+---+----------------------------------------------------------+
|id |json_data |
+---+----------------------------------------------------------+
|1 |{"name":"abc", "depts":["dep01", "dep02"]} |
|2 |{"name":"xyz", "depts":["dep03"],"sal":100} |
|3 |{"name":"pqr", "depts":["dep02"], "address":{"city":"SF"}}|
+---+----------------------------------------------------------+
Convert json_data column to MapType as below -
from pyspark.sql.functions import *
from pyspark.sql.types import *
df1 = df.withColumn("cols", from_json("json_data", MapType(StringType(), StringType()))).drop("json_data")
df1.show(truncate=False)
+---+-----------------------------------------------------------+
|id |cols |
+---+-----------------------------------------------------------+
|1 |{name -> abc, depts -> ["dep01","dep02"]} |
|2 |{name -> xyz, depts -> ["dep03"], sal -> 100} |
|3 |{name -> pqr, depts -> ["dep02"], address -> {"city":"SF"}}|
+---+-----------------------------------------------------------+
Now, column cols needs to be exploded as below -
df2 = df1.select("id",explode("cols").alias("col_columns", "col_rows"))
df2.show(truncate=False)
+---+-----------+-----------------+
|id |col_columns|col_rows |
+---+-----------+-----------------+
|1 |name |abc |
|1 |depts |["dep01","dep02"]|
|2 |name |xyz |
|2 |depts |["dep03"] |
|2 |sal |100 |
|3 |name |pqr |
|3 |depts |["dep02"] |
|3 |address |{"city":"SF"} |
+---+-----------+-----------------+
Once, you have col_columns and col_rows as individual columns, all that is needed to do is pivot col_columns and aggregate it using its corresponding first col_rows as below -
df3 = df2.groupBy("id").pivot("col_columns").agg(first("col_rows"))
df3.show(truncate=False)
+---+-------------+-----------------+----+----+
|id |address |depts |name|sal |
+---+-------------+-----------------+----+----+
|1 |null |["dep01","dep02"]|abc |null|
|2 |null |["dep03"] |xyz |100 |
|3 |{"city":"SF"}|["dep02"] |pqr |null|
+---+-------------+-----------------+----+----+
Finally, you again need to repeat the above steps to bring address in structured format as below -
df4 = df3.withColumn("address", from_json("address", MapType(StringType(), StringType())))
df4.select("id", "depts", "name", "sal",explode_outer("address").alias("key", "address_city")).drop("key").show(truncate=False)
+---+-----------------+----+----+------------+
|id |depts |name|sal |address_city|
+---+-----------------+----+----+------------+
|1 |["dep01","dep02"]|abc |null|null |
|2 |["dep03"] |xyz |100 |null |
|3 |["dep02"] |pqr |null|SF |
+---+-----------------+----+----+------------+
In order to solve it you can use split function as code below.
The function takes 2 parameters, the first one is the column itself and the second is the pattern to split the elements from column array.
More information and examples cand be found here:
https://sparkbyexamples.com/pyspark/pyspark-convert-string-to-array-column/#:~:text=PySpark%20SQL%20provides%20split(),and%20converting%20it%20into%20ArrayType.
from pyspark.sql import functions as F
df.select(F.split(F.col('depts'), ','))
To parse and promote the properties from a JSON string column without a known schema dynamically, I am afraid you cannot use pyspark, it can be done by using Scala.
For example when you have some avro files produced by Kafka, and you want to be able to parse the Value which is a serialized JSON string dynamically
var df = spark.read.format("avro").load("abfss://abc#def.dfs.core.windows.net/xyz.avro").select("Value")
var df_parsed = spark.read.json(df.as[String])
display(df_parsed)
The key is spark.read.json(df.as[String]) in Scala, it basically
Convert that DF ( it has only one column that we are interested in in this case, you can of course deal with multiple interested columns similarily and union whatever you want ) to String.
Parse the JSON string using standard spark read option, this does not require a schema.
So far there is no equivalent methods exposed to pyspark as far as I know.

Pyspark: match columns from two different dataframes and add value

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:
df1=
| id |
| -- |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
df2 =
| id |
| -- |
| 2 |
| 5 |
| 1 |
So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:
df3 =
| id | is_used |
| -- | ------- |
| 1 | X |
| 2 | X |
| 3 | NA |
| 4 | NA |
| 5 | X |
I have tried this way, but the selection criteria places an "X" in all columns:
df3 = df3.withColumn('is_used', F.when(
condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
value = 'NA'
).otherwise('X'))
I would appreciate any help
Try with fullouter join:
df3 = (
df1.join(df2.alias("df2"), df1.id == df2.id, "fullouter")
.withColumn(
"is_used",
F.when(F.col("df2.id").isNotNull(), F.lit("X")).otherwise(F.lit("NA")),
)
.drop(F.col("df2.id"))
.orderBy(F.col("id"))
)
Result:
+---+-------+
|id |is_used|
+---+-------+
|1 |X |
|2 |X |
|3 |NA |
|4 |NA |
|5 |X |
+---+-------+
Try the following code, it would give you a similar result and you can make the rest of the changes:
df3 = df1.alias("df1").\
join(df2.alias("df2"), (df1.id==df2.id), how='left').\
withColumn('is_true', F.when(df1.id == df2.id,F.lit("X")).otherwise(F.lit("NA"))).\
select("df1.*","is_true")
df3.show()
First of all, I want to thank the people who contributed their code, it was very useful to understand what was happening.
The problem was that when trying to do df1.id == df2.id Spark was inferring both columns as one because they both had the same name, so the result of all iterations would always be True.
Just rename the fields I wanted to compare and it totally worked for me.
Here is the code:
df2 = df2.withColumnRenamed("id", "id1")
df3 = df1.alias("df1").join(df2.alias("df2"),
(df1.id == df2.id1), "left")
df3 = df3.withColumn("is_used", F.when(df1.id == df2.id1),
"X").otherwise("NA")
df3 = df3.drop("id1")

is there a way to get date difference over a certain column

i want to calculate the time difference/date difference for each unique name it took for the status to get from order to arrived.
Input dataframe is like this
+------------------------------+
| Date | id | name |staus
+------------------------------+
| 1986/10/15| A |john |order
| 1986/10/16| A |john |dispatched
| 1986/10/18| A |john |arrived
| 1986/10/15| B |peter|order
| 1986/10/16| B |peter|dispatched
| 1986/10/17| B |peter|arrived
| 1986/10/16| C |raul |order
| 1986/10/17| C |raul |dispatched
| 1986/10/18| C |raul |arrived
+-----------------------------+
the expected output dataset should look similar to this
+---------------------------------------------------+
| id | name |time_difference_from_order_to_delivered|
+---------------------------------------------------+
A | john | 3days
B |peter | 2days
C | Raul | 2days
+---------------------------------------------------+
I am stuck on what logic to implement
You can group by and calculate the date diff using a conditional aggregation:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
F.datediff(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd'),
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3|
| C| raul| 2|
| B|peter| 2|
+---+-----+---------+
You can also directly subtract the dates, which will return an interval type column:
import pyspark.sql.functions as F
df2 = df.groupBy('id', 'name').agg(
(
F.to_date(F.max(F.when(F.col('staus') == 'arrived', F.col('Date'))), 'yyyy/MM/dd') -
F.to_date(F.min(F.when(F.col('staus') == 'order', F.col('Date'))), 'yyyy/MM/dd')
).alias('time_diff')
)
df2.show()
+---+-----+---------+
| id| name|time_diff|
+---+-----+---------+
| A| john| 3 days|
| C| raul| 2 days|
| B|peter| 2 days|
+---+-----+---------+
Assuming ordered is the earliest date and delivered is the last, just use aggregation and datediff():
select id, name, datediff(max(date), min(date)) as num_days
from t
group by id, name;
For more precision, you can use conditional aggregation:
select id, name,
datediff(max(case when status = 'arrived' then date end)
min(case when status = 'order' then date end)
) as num_days
from t
group by id, name;

Filtering Based on Dates and ID's from Two Dataframes: Pyspark

I have two dataframes:
DF1:
+----------+-----------+-----------+
| ID|Dx_Min_Date|Dx_Max_Date|
| 30794324| 2014-04-21| 2015-07-01|
| 31234323| 2013-07-04| 2017-05-02|
+----------+-----------+-----------+
DF2:
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
| 30794324| 12| 2017-08-02|
| 54321367| 14| 2014-05-02|
+----------+-----------+-----------+
I want to filter the Dataframe DF2 based upon the ID's of DF1 and being between the Min and Max dates as given by the columns Dx_Min_Date and Dx_Max_Date. Resulting in:
+----------+-----------+-----------+
| ID| Procedure| Date|
| 30794324| 32| 2014-06-21|
| 30794324| 14| 2014-04-25|
+----------+-----------+-----------+
Is there a way to filter based on columns of one dataframe for another?
Use non equi join:
df2.alias('tmp').join(
df1,
(df2.ID == df1.ID) &
(df2.Date >= df1.Dx_Min_Date) &
(df2.Date <= df1.Dx_Max_Date)
).select('tmp.*').show()
+--------+---------+----------+
| ID|Procedure| Date|
+--------+---------+----------+
|30794324| 32|2014-06-21|
|30794324| 14|2014-04-25|
+--------+---------+----------+

Categories

Resources