Comparing two values in a structfield of a column in pyspark - python

I have Column where each row is a StructField. I want to get max of two values in the StructField.
I tried this
trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key))
But it throws this error
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I am now getting it done with UDFs
max_key = lambda x: x if x else float("-inf")
_get_max_udf = udf(lambda x, y: max(x,y, key=max_key), FloatType())
trends_df = trends_df.withColumn("importance_score", _get_max_udf(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
This works, but I want to know if there a way I can avoid using the udf and get it done with just spark.
Edit:
This is the result of trends_df.printSchema()
root
|-- avg_total: struct (nullable = true)
| |-- max: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- max_index: long (nullable = true)
| | |-- max_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)
| |-- min: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- min_index: long (nullable = true)
| | |-- min_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)

Adding an answer from the comments to highlight it.
As answered by #smurphy I used the greatest function
trends_df = trends_df.withColumn("importance_score", greatest(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.greatest

Related

Pyspark Sedona: How to unnest column values to convert string to geometry?

I want to convert column from string to polygon with Sedona.
I am not quiet sure, if the problem is with 1 nested StructType too much, or another Sedona Function (other than ST_GeomFromWKB) is necessary, or i have to join columns "geometry_type" and "geometry_polygon" first. Anyone experienced & solved before?
My Dataframe looks like this:
root
|-- geo_name: string (nullable = true)
|-- geo_latitude: double (nullable = true)
|-- geo_longitude: double (nullable = true)
|-- geo_bst: integer (nullable = true)
|-- geo_bvw: integer (nullable = true)
|-- geometry_type: string (nullable = true)
|-- geometry_polygon: string (nullable = true)
The columns with geoinformation looks like this:
+--------------------+
| geometry_polygon|
+--------------------+
|[[[8.4937, 49.489...|
|[[[5.0946, 51.723...|
|[[[-8.5776, 43.54...|
|[[[-8.5762, 43.55...|
|[[[6.0684, 50.766...|
+--------------------+
+-------------+
|geometry_type|
+-------------+
| Polygon|
| Polygon|
| Polygon|
| Polygon|
| Polygon|
I tried:
station_groups_gdf.createOrReplaceTempView("station_gdf")
spatial_station_groups_gdf = spark_sedona.sql("SELECT *, ST_GeomFromWKB(station_gdf.geometry_polygon) AS geometry_shape FROM station_gdf")
Error Message is:
ERROR FormatUtils: [Sedona] Invalid hex digit: '['
ERROR Executor: Exception in task 0.0 in stage 57.0 (TID 53)
java.lang.IllegalArgumentException: Invalid hex digit: '['

Flatten a nested array of array & structs in Pyspark

I have a schema of this form from a json file:
root
|-- fruit_id: string (nullable = true)
|-- fruit_type: array (nullable = true)
| |-- name: string (nullable = true)
| |-- info: struct (nullable = true)
| |-- fruit_quality: array (nullable = true)
| | |-- quality: string (nullable = true)
| |-- likes: string (containsNull = true)
| |-- finance: struct (nullable = true)
| | |-- last_year_price: string (nullable = true)
| | |-- current_price: string (nullable = true)
| |-- shops: struct (nullable = true)
| | |-- shop1: string (nullable = true)
| | |-- shop2: string (nullable = true)
|-- season: string (nullable = true)
How can I get it of this form?
root
|-- fruit_id: string (nullable = true)
|-- fruit_type_name: string (nullable = true)
|-- fruit_type_info_fruit_quality_quality: string (nullable = true)
|-- fruit_type_info_likes: string (nullable = true)
|-- fruit_type_finance_last_year_price: string (nullable = true)
|-- fruit_type_finance_current_price: string (nullable = true)
|-- fruit_type_shops_shop1: string (nullable = true)
|-- fruit_type_shops_shop2: string (nullable = true)
|-- season: string (nullable = true)
This is for the case of fruits. How would I flatten it similar way if I receive a file with info on vegetables ?
I am facing issue while flattening the array part. I am able to flatten structs inside structs, I followed this: link
I also added this piece of code to code on above link, to see if this approach would work:
import pyspark.sql.functions as F
array_cols = [c[0] for c in df.dtypes if c[1][:6] == 'array']
df = df.select(
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in array_cols
for c in df.select(nc+'.*').columns])
But it's not working.
I then checked this link as well: link
But here issue is if I want to flatten the json file of fruits, It is possible, but then if I send a json file of vegetables with similar schema, I'll have to redefine the code.
Another approach I went for was converting an array to struct & then I could use the flatten the nested structs, but that wasn't helpful.
Lastly, I checked this link as well: link
But this approach threw an error, saying flattening not possible, since I have array of structs & not an array of array.
So how can I solve this?

Union for Nested Spark Data Frames

Suppose we have two data frames df1 and df2 with the following schema:
A
|-- B: struct (nullable = true)
| |-- b1: string (nullable = true)
| |-- b2: string (nullable = true)
| |-- b3: string (nullable = true)
| |-- C: array (nullable = true)
| | |-- D: struct (containsNull = true)
| | | |-- d1: string (nullable = true)
| | | |-- d2: string (nullable = true)
Would df1.union(df2)work for these nested data frames if you wanted to add a new record? Or would you have to flatten them first if you wanted to add a new record?
This should work, here is a knowledge article by databricks
https://kb.databricks.com/data/append-a-row-to-rdd-or-dataframe.html
and you won't need to flatten your struct fields.
PS: Please ensure your column are in same orders in both dataframe.

How access struct elements inside pyspark dataframe?

I have the following schema for a pyspark dataframe
root
|-- maindata: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- label: string (nullable = true)
| | | |-- value: string (nullable = true)
| | | |-- unit: string (nullable = true)
| | | |-- dateTime: string (nullable = true)
Giving some data for a particular row which I received by df.select(F.col("maindata")).show(1,False)
|[[[a1, 43.24, km/h, 2019-04-06T13:02:08.020], [TripCount, 135, , 2019-04-06T13:02:08.790],["t2", 0, , 2019-04-06T13:02:08.040], [t4, 0, , 2019-04-06T13:02:08.050], [t09, 0, , 2019-04-06T13:02:08.050], [t3, 1, , 2019-04-06T13:02:08.050], [t7, 0, , 2019-04-06T13:02:08.050],[TripCount, ,136, 2019-04-06T13:02:08.790]]
I want access the tripcount value inside this ex: [TripCount -> 136,135 etc,What is the best way to access this data?TripC is present multiple times
and also is there any way to access say for example only label data like maindata.label..?
I would suggest to do explode multiple times, to convert array elements into individual rows, and then either convert struct into individual columns, or work with nested elements using the dot syntax. For example:
from pyspark.sql.functions import col, explode
df=spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d'))
>>> df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
| |-- _3: string (nullable = true)
>>> df2.filter(col("data._1") == "k1").show()
+------------+
| data|
+------------+
|[k1, v1, v2]|
+------------+
or you can extract members of the struct as individual columns:
from pyspark.sql.functions import col, explode
df = spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d')).select("d.*").drop("d")
>>> df2.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)
>>> df2.filter(col("_1") == "k1").show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| k1| v1| v2|
+---+---+---+

nested json to tsv in databricks pyspark

Want to convert a nested json to tsv in databricks notebook using pysoark.
Below is json structure where columns can be changed.
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
I am new in databricks Please help
you have two ways to deal with this problem. Either you do some preprocessing in python with json library (or equivalent), or you load directly into pyspark and play around such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
# your json
so_json = """
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
"""
# load in directly using read.json(), you'll see that this becomes
# a nested ArrayType/StructType wombo combo
json_df = spark.read.json(spark._sc.parallelize([so_json]))
json_df.printSchema()
root
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- columns: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- rows: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
# select nested columns "tables" and "rows" and explode
array_df = json_df.select(f.explode(f.col('tables')['rows'][0]))
Exploding takes the rows which is ArrayType and splits it into actual rows.
Then you can subselect either by dot or slice notation
array_df.printSchema()
root
|-- col: array (nullable = true)
| |-- element: string (containsNull = true)
tabular_df = array_df.select(
array_df.col[0].alias("JobTime"),
array_df.col[1].alias("Status")
)
tabular_df.show()
+--------------------+------+
| JobTime|Status|
+--------------------+------+
|2020-04-19T13:45:...|Failed|
|2020-04-19T14:05:...|Failed|
|2020-04-19T13:46:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:03:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:02:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:05:...|Failed|
+--------------------+------+
Finally, you want to save as CSV with a custom separator (\t). Hence:
tabular_df.write.csv("path/to/file.tsv", sep="\t")
NB: You may need to manually control for types, such as converting JobTime to TimestampType, but I'll leave that up to you.
Hope this helps.

Categories

Resources