Check if PySaprk column values exists in another dataframe column values

Check if PySaprk column values exists in another dataframe column values - python

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |

Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+

Related

Pyspark: match columns from two different dataframes and add value

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:
df1=
| id |
| -- |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
df2 =
| id |
| -- |
| 2 |
| 5 |
| 1 |
So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:
df3 =
| id | is_used |
| -- | ------- |
| 1 | X |
| 2 | X |
| 3 | NA |
| 4 | NA |
| 5 | X |
I have tried this way, but the selection criteria places an "X" in all columns:
df3 = df3.withColumn('is_used', F.when(
condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
value = 'NA'
).otherwise('X'))
I would appreciate any help

Try with fullouter join:
df3 = (
df1.join(df2.alias("df2"), df1.id == df2.id, "fullouter")
.withColumn(
"is_used",
F.when(F.col("df2.id").isNotNull(), F.lit("X")).otherwise(F.lit("NA")),
)
.drop(F.col("df2.id"))
.orderBy(F.col("id"))
)
Result:
+---+-------+
|id |is_used|
+---+-------+
|1 |X |
|2 |X |
|3 |NA |
|4 |NA |
|5 |X |
+---+-------+

Try the following code, it would give you a similar result and you can make the rest of the changes:
df3 = df1.alias("df1").\
join(df2.alias("df2"), (df1.id==df2.id), how='left').\
withColumn('is_true', F.when(df1.id == df2.id,F.lit("X")).otherwise(F.lit("NA"))).\
select("df1.*","is_true")
df3.show()

First of all, I want to thank the people who contributed their code, it was very useful to understand what was happening.
The problem was that when trying to do df1.id == df2.id Spark was inferring both columns as one because they both had the same name, so the result of all iterations would always be True.
Just rename the fields I wanted to compare and it totally worked for me.
Here is the code:
df2 = df2.withColumnRenamed("id", "id1")
df3 = df1.alias("df1").join(df2.alias("df2"),
(df1.id == df2.id1), "left")
df3 = df3.withColumn("is_used", F.when(df1.id == df2.id1),
"X").otherwise("NA")
df3 = df3.drop("id1")

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"

You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]

Creating a JSON struct from available rows after Group By in PySpark

I would like to group by customer_id, so that I can collect the key/value pair of field_name and field_value as a JSON struct. So, for example, I have a dataframe like this:
customerID | field_name | field_value
-------------------------------------
A | age | 20
A | sex | M
A | country | US
B | country | US
B | age | 34
c | sex | F
All columns in the DataFrame have a String data type. What I want is this:
customerID | custom_attributes
-------------------------------------
A | {'age':'20', 'sex': 'M', 'country':'US'}
B | {'age':'34', 'country':'US'}
c | {'sex':'F'}
This is what I tried:
test = (data
.groupBy('customer_id')
.agg(
collect_list(struct(col('field_name'), col('field_value'))).alias('custom_attributes'))
)
But this gets me as far as getting an array that I dont know how to flatten:
customer_id | custom_attributes
--------------------------------
A | [{'field_name':'sex', 'field_value':'M'},
| {'field_name':'age', 'field_value':'34'},
| {'field_name':'country', 'field_value':'US'}]

You need to do a pivot here:
import pyspark.sql.functions as F
df2 = (df.groupBy('customerID')
.pivot('field_name')
.agg(F.first('field_value'))
.select('customerID', F.to_json(F.struct('age', 'country', 'sex')).alias('custom_attributes'))
.orderBy('customerID')
)
df2.show(truncate=False)
+----------+-------------------------------------+
|customerID|custom_attributes |
+----------+-------------------------------------+
|A |{"age":"20","country":"US","sex":"M"}|
|B |{"age":"34","country":"US"} |
|c |{"sex":"F"} |
+----------+-------------------------------------+

Iterate pyspark dataframe rows and apply UDF

I have a dataframe that looks like this:
partitionCol orderCol valueCol
+--------------+----------+----------+
| partitionCol | orderCol | valueCol |
+--------------+----------+----------+
| A | 1 | 201 |
| A | 2 | 645 |
| A | 3 | 302 |
| B | 1 | 335 |
| B | 2 | 834 |
+--------------+----------+----------+
I want to group by the partitionCol, then within each partition to iterate over the rows, ordered by orderCol and apply some function to calculate a new column based on the valueCol and a cached value.
e.g.
def foo(col_value, cached_value):
tmp = <some value based on a condition between col_value and cached_value>
<update the cached_value using some logic>
return tmp
I understand I need to groupby the partitionCol and apply a UDF that will operate on each chink separately, but struggling to find a good way to iterate the rows and applying the logic I described, to get a desired output of:
+--------------+----------+----------+---------------+
| partitionCol | orderCol | valueCol | calculatedCol -
+--------------+----------+----------+---------------+
| A | 1 | 201 | C1 |
| A | 2 | 645 | C1 |
| A | 3 | 302 | C2 |
| B | 1 | 335 | C1 |
| B | 2 | 834 | C2 |
+--------------+----------+----------+---------------+

I think the best way for you to do that is to apply an UDF on the whole set of data :
# first, you create a struct with the order col and the valu col
df = df.withColumn("my_data", F.struct(F.col('orderCol'), F.col('valueCol'))
# then you create an array of that new column
df = df.groupBy("partitionCol").agg(F.collect_list('my_data').alias("my_data")
# finaly, you apply your function on that array
df = df.withColumn("calculatedCol", my_udf(F.col("my_data"))
But without knowing exactly what you want to do, that is all I can offer.

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much

you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()

You can do that in scala if both dataframes have same columns by
output = df1.union(df2)

First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.

What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Check if PySaprk column values exists in another dataframe column values - python

Related

Pyspark: match columns from two different dataframes and add value

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

Creating a JSON struct from available rows after Group By in PySpark

Iterate pyspark dataframe rows and apply UDF

Merge two spark dataframes based on a column

Categories

Resources