Creating a JSON struct from available rows after Group By in PySpark - python

I would like to group by customer_id, so that I can collect the key/value pair of field_name and field_value as a JSON struct. So, for example, I have a dataframe like this:
customerID | field_name | field_value
-------------------------------------
A | age | 20
A | sex | M
A | country | US
B | country | US
B | age | 34
c | sex | F
All columns in the DataFrame have a String data type. What I want is this:
customerID | custom_attributes
-------------------------------------
A | {'age':'20', 'sex': 'M', 'country':'US'}
B | {'age':'34', 'country':'US'}
c | {'sex':'F'}
This is what I tried:
test = (data
.groupBy('customer_id')
.agg(
collect_list(struct(col('field_name'), col('field_value'))).alias('custom_attributes'))
)
But this gets me as far as getting an array that I dont know how to flatten:
customer_id | custom_attributes
--------------------------------
A | [{'field_name':'sex', 'field_value':'M'},
| {'field_name':'age', 'field_value':'34'},
| {'field_name':'country', 'field_value':'US'}]

You need to do a pivot here:
import pyspark.sql.functions as F
df2 = (df.groupBy('customerID')
.pivot('field_name')
.agg(F.first('field_value'))
.select('customerID', F.to_json(F.struct('age', 'country', 'sex')).alias('custom_attributes'))
.orderBy('customerID')
)
df2.show(truncate=False)
+----------+-------------------------------------+
|customerID|custom_attributes |
+----------+-------------------------------------+
|A |{"age":"20","country":"US","sex":"M"}|
|B |{"age":"34","country":"US"} |
|c |{"sex":"F"} |
+----------+-------------------------------------+

Related

Check if PySaprk column values exists in another dataframe column values

I'm trying to figure out the condition to check if the values of one PySpark dataframe exist in another PySpark dataframe, and if so extract the value and compare again. I was thinking of doing a multiple withColumn() with a when() function.
For example my two dataframes can be something like:
df1
| id | value |
| ----- | ---- |
| hello | 1111 |
| world | 2222 |
df2
| id | value |
| ------ | ---- |
| hello | 1111 |
| world | 3333 |
| people | 2222 |
And the result I wish to obtain is to check first if the value of df1.id exists in df2.id and if true return me the df2.value, for example I was trying something like:
df1 = df1.withColumn("df2_value", when(df1.id == df2.id, df2.value))
So I get something like:
df1
| id | value | df2_value |
| ----- | ---- | --------- |
| hello | 1111 | 1111 |
| world | 2222 | 3333 |
So that now I can do another check between these two value columns in the df1 dataframe, and return a boolean column (1or 0) in a new dataframe.
The result I wish to get would be something like:
df3
| id | value | df2_value | match |
| ----- | ---- | --------- | ----- |
| hello | 1111 | 1111 | 1 |
| world | 2222 | 3333 | 0 |
Left join df1 with df2 on id after prefixing all df2 columns except id with df2_*:
from pyspark.sql import functions as F
df1 = spark.createDataFrame([("hello", 1111), ("world", 2222)], ["id", "value"])
df2 = spark.createDataFrame([("hello", 1111), ("world", 3333), ("people", 2222)], ["id", "value"])
df = df1.join(
df2.select("id", *[F.col(c).alias(f"df2_{c}") for c in df2.columns if c != 'id']),
["id"],
"left"
)
Then using functools.reduce you can construct a boolean expression to check if columns match in the 2 dataframes like this:
from functools import reduce
check_expr = reduce(
lambda acc, x: acc & (F.col(x) == F.col(f"df2_{x}")),
[c for c in df1.columns if c != 'id'],
F.lit(True)
)
df.withColumn("match", check_expr.cast("int")).show()
#+-----+-----+---------+-----+
#| id|value|df2_value|match|
#+-----+-----+---------+-----+
#|hello| 1111| 1111| 1|
#|world| 2222| 3333| 0|
#+-----+-----+---------+-----+

Pyspark: match columns from two different dataframes and add value

I am trying to compare the values of two columns that exist in different dataframes to create a new dataframe based on the matching of the criteria:
df1=
| id |
| -- |
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
df2 =
| id |
| -- |
| 2 |
| 5 |
| 1 |
So, I want to add an 'x' in the is_used field when the field of df2 exists in the field of df1, else add 'NA', to generate a result dataframe like this:
df3 =
| id | is_used |
| -- | ------- |
| 1 | X |
| 2 | X |
| 3 | NA |
| 4 | NA |
| 5 | X |
I have tried this way, but the selection criteria places an "X" in all columns:
df3 = df3.withColumn('is_used', F.when(
condition = (F.arrays_overlap(F.array(df1.id), F.array(df2.id))) == False,
value = 'NA'
).otherwise('X'))
I would appreciate any help
Try with fullouter join:
df3 = (
df1.join(df2.alias("df2"), df1.id == df2.id, "fullouter")
.withColumn(
"is_used",
F.when(F.col("df2.id").isNotNull(), F.lit("X")).otherwise(F.lit("NA")),
)
.drop(F.col("df2.id"))
.orderBy(F.col("id"))
)
Result:
+---+-------+
|id |is_used|
+---+-------+
|1 |X |
|2 |X |
|3 |NA |
|4 |NA |
|5 |X |
+---+-------+
Try the following code, it would give you a similar result and you can make the rest of the changes:
df3 = df1.alias("df1").\
join(df2.alias("df2"), (df1.id==df2.id), how='left').\
withColumn('is_true', F.when(df1.id == df2.id,F.lit("X")).otherwise(F.lit("NA"))).\
select("df1.*","is_true")
df3.show()
First of all, I want to thank the people who contributed their code, it was very useful to understand what was happening.
The problem was that when trying to do df1.id == df2.id Spark was inferring both columns as one because they both had the same name, so the result of all iterations would always be True.
Just rename the fields I wanted to compare and it totally worked for me.
Here is the code:
df2 = df2.withColumnRenamed("id", "id1")
df3 = df1.alias("df1").join(df2.alias("df2"),
(df1.id == df2.id1), "left")
df3 = df3.withColumn("is_used", F.when(df1.id == df2.id1),
"X").otherwise("NA")
df3 = df3.drop("id1")

Python DataFrame - Select dataframe rows based on values in a column of same dataframe

I'm struggling with a dataframe related problem.
columns = [desc[0] for desc in cursor.description]
data = cursor.fetchall()
df2 = pd.DataFrame(list(data), columns=columns)
df2 is as follows:
| Col1 | Col2 |
| -------- | -------------- |
| 2145779 | 2 |
| 8059234 | 3 |
| 2145779 | 3 |
| 4265093 | 2 |
| 2145779 | 2 |
| 1728234 | 5 |
I want to make a list of values in col1 where value of col2="3"
You can use boolean indexing:
out = df2.loc[df2.Col2.eq(3), "Col1"].agg(list)
print(out)
Prints:
[8059234, 2145779]

vlookup on text field using pandas

I need to use vlookup functionality in pandas.
DataFrame 2: (FEED_NAME has no duplicate rows)
+-----------+--------------------+---------------------+
| FEED_NAME | Name | Source |
+-----------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Effiel Tower router |
+-----------+--------------------+---------------------+
DataFrame 1:
+-------------+
| SYSTEM_NAME |
+-------------+
| DMSN |
| PCSUS |
| DAMJ |
| : |
| : |
+-------------+
DataFrame 1 contains lot more number of rows. It is acutally a column in much larger table. I need to merger df1 with df2 to make it look like:
+-------------+--------------------+---------------------+
| SYSTEM_NAME | Name | Source |
+-------------+--------------------+---------------------+
| DMSN | DMSN_YYYYMMDD.txt | Main Hub |
| PCSUS | PCSUS_YYYYMMDD.txt | Basement |
| DAMJ | DAMJ_YYYYMMDD.txt | Eiffel Tower router |
| : | | |
| : | | |
+-------------+--------------------+---------------------+
in excel I just would have done VLOOKUP(,,1,TRUE) and then copied the same across all cells.
I have tried various combinations with merge and join but I keep getting KeyError:'SYSTEM_NAME'
Code:
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"})
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()
You missed the inplace=True argument in the line _df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}) so the _df2 columns name haven't changed. Try this instead :
_df1 = df1[["SYSTEM_NAME"]]
_df2 = df2[['FEED_NAME','Name','Source']]
_df2.rename(columns = {'FEED_NAME':"SYSTEM_NAME"}, inplace=True)
_df3 = pd.merge(_df1,_df2,how='left',on='SYSTEM_NAME')
_df3.head()

Merge two spark dataframes based on a column

I have 2 dataframes which I need to merge based on a column (Employee code). Please note that the dataframe has about 75 columns, so I am providing a sample dataset to get some suggestions/sample solutions. I am using databricks, and the datasets are read from S3.
Following are my 2 dataframes:
DATAFRAME - 1
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | | | | | | | |
|-----------------------------------------------------------------------------------|
DATAFRAME - 2
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | | | | | C | | | | |
|B10001 | | | | | | | | |T2 |
|A10001 | | | | | | | | B | |
|A10001 | | | C | | | | | | |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
I need to merge the 2 dataframes based on EMP_CODE, basically join dataframe1 with dataframe2, based on emp_code. I am getting duplicate columns when i do a join, and I am looking for some help.
Expected final dataframe:
|-----------------------------------------------------------------------------------|
|EMP_CODE |COLUMN1|COLUMN2|COLUMN3|COLUMN4|COLUMN5|COLUMN6|COLUMN7|COLUMN8|COLUMN9|
|-----------------------------------------------------------------------------------|
|A10001 | B | | C | | C | | | B | |
|B10001 | | | | | | | | |T2 |
|C10001 | | | | | | C | | | |
|-----------------------------------------------------------------------------------|
There are 3 rows with emp_code A10001 in dataframe1, and 1 row in dataframe2. All data should be merged as one record without any duplicate columns.
Thanks much
you can use inner join
output = df1.join(df2,['EMP_CODE'],how='inner')
also you can apply distinct at the end to remove duplicates.
output = df1.join(df2,['EMP_CODE'],how='inner').distinct()
You can do that in scala if both dataframes have same columns by
output = df1.union(df2)
First you need to aggregate the individual dataframes.
from pyspark.sql import functions as F
df1 = df1.groupBy('EMP_CODE').agg(F.concat_ws(" ", F.collect_list(df1.COLUMN1)))
you have to write this for all columns and for all dataframes.
Then you'll have to use union function on all dataframes.
df1.union(df2)
and then repeat same aggregation on that union dataframe.
What you need is a union.
If both dataframes have the same number of columns and the columns that are to be "union-ed" are positionally the same (as in your example), this will work:
output = df1.union(df2).dropDuplicates()
If both dataframes have the same number of columns and the columns that need to be "union-ed" have the same name (as in your example as well), this would be better:
output = df1.unionByName(df2).dropDuplicates()

Categories

Resources