I have Column where each row is a StructField. I want to get max of two values in the StructField.
I tried this
trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key))
But it throws this error
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I am now getting it done with UDFs
max_key = lambda x: x if x else float("-inf")
_get_max_udf = udf(lambda x, y: max(x,y, key=max_key), FloatType())
trends_df = trends_df.withColumn("importance_score", _get_max_udf(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
This works, but I want to know if there a way I can avoid using the udf and get it done with just spark.
Edit:
This is the result of trends_df.printSchema()
root
|-- avg_total: struct (nullable = true)
| |-- max: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- max_index: long (nullable = true)
| | |-- max_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)
| |-- min: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- min_index: long (nullable = true)
| | |-- min_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)
Adding an answer from the comments to highlight it.
As answered by #smurphy I used the greatest function
trends_df = trends_df.withColumn("importance_score", greatest(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.greatest
I have a dataframe with the following schema using pyspark:
|-- suborders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- trackingStatusHistory: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- trackingStatusUpdatedAt: string (nullable = true)
| | | | |-- trackingStatus: string (nullable = true)
What I want to do is create a new deliveredat element for each suborders array using conditions.
I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.
How can I do this using pyspark?
You can do that using higher-order functions transform + filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:
import pyspark.sql.functions as F
df = df.withColumn(
"suborders",
F.expr("""transform(
suborders,
x -> struct(
filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
x.trackingStatusHistory as trackingStatusHistory
)
)
""")
)
Suppose we have two data frames df1 and df2 with the following schema:
A
|-- B: struct (nullable = true)
| |-- b1: string (nullable = true)
| |-- b2: string (nullable = true)
| |-- b3: string (nullable = true)
| |-- C: array (nullable = true)
| | |-- D: struct (containsNull = true)
| | | |-- d1: string (nullable = true)
| | | |-- d2: string (nullable = true)
Would df1.union(df2)work for these nested data frames if you wanted to add a new record? Or would you have to flatten them first if you wanted to add a new record?
This should work, here is a knowledge article by databricks
https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html
and you won't need to flatten your struct fields.
PS: Please ensure your column are in same orders in both dataframe.
Want to convert a nested json to tsv in databricks notebook using pysoark.
Below is json structure where columns can be changed.
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
I am new in databricks Please help
you have two ways to deal with this problem. Either you do some preprocessing in python with json library (or equivalent), or you load directly into pyspark and play around such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
# your json
so_json = """
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
"""
# load in directly using read.json(), you'll see that this becomes
# a nested ArrayType/StructType wombo combo
json_df = spark.read.json(spark._sc.parallelize([so_json]))
json_df.printSchema()
root
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- columns: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- rows: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
# select nested columns "tables" and "rows" and explode
array_df = json_df.select(f.explode(f.col('tables')['rows'][0]))
Exploding takes the rows which is ArrayType and splits it into actual rows.
Then you can subselect either by dot or slice notation
array_df.printSchema()
root
|-- col: array (nullable = true)
| |-- element: string (containsNull = true)
tabular_df = array_df.select(
array_df.col[0].alias("JobTime"),
array_df.col[1].alias("Status")
)
tabular_df.show()
+--------------------+------+
| JobTime|Status|
+--------------------+------+
|2020-04-19T13:45:...|Failed|
|2020-04-19T14:05:...|Failed|
|2020-04-19T13:46:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:03:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:02:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:05:...|Failed|
+--------------------+------+
Finally, you want to save as CSV with a custom separator (\t). Hence:
tabular_df.write.csv("path/to/file.tsv", sep="\t")
NB: You may need to manually control for types, such as converting JobTime to TimestampType, but I'll leave that up to you.
Hope this helps.
I need to convert a PySpark df column type from array to string and also remove the square brackets. This is the schema for the dataframe. columns that needs to be processed is CurrencyCode and TicketAmount
>>> plan_queryDF.printSchema()
root
|-- event_type: string (nullable = true)
|-- publishedDate: string (nullable = true)
|-- plannedCustomerChoiceID: string (nullable = true)
|-- assortedCustomerChoiceID: string (nullable = true)
|-- CurrencyCode: array (nullable = true)
| |-- element: string (containsNull = true)
|-- TicketAmount: array (nullable = true)
| |-- element: string (containsNull = true)
|-- currentPlan: boolean (nullable = true)
|-- originalPlan: boolean (nullable = true)
|-- globalId: string (nullable = true)
|-- PlanJsonData: string (nullable = true)
sample data from dataframe
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| [GBP]| [0]| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| [CNY]| [329]| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| [JPY]| [3400]| true| false|000576058003|{"httpStatus":200...|
how can I do it? Currently I am doing a cast to string and then replacing the square braces with regexp_replace. but this approach fails when I process huge amount of data.
Is there any other way I can do it?
This is what I want.
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| GBP| 0| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| CNY| 329| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| JPY| 3400| true| false|000576058003|{"httpStatus":200...|
You can try getItem(0):
df \
.withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
.withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))
The final cast to string is optional.