Pyspark Extract Values from from Array of maps in structured streaming

Pyspark Extract Values from from Array of maps in structured streaming - python

I have the following schema:
root
|-- sents: array (nullable = false)
| |-- element: integer (containsNull = true)
|-- metadata: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
In a table it looks like this:
+----------+---------------------------------------------------------------------+
|sents |metadata |
+----------+---------------------------------------------------------------------+
|[1, -1, 0]|[[confidence -> 0.4991], [confidence -> 0.5378], [confidence -> 0.0]]|
+----------+---------------------------------------------------------------------+
How can I access the te Value from this list of maps within the array column?
thank you

Here are two options using explode and transform high-order function in Spark.
Option 1 (explode + pyspark accessors)
First we explode elements of the array into a new column, next we access the map using the key metadata to retrieve the value:
from pyspark.sql.functions import col, explode, expr
df = spark.createDataFrame([
[[{"confidence":0.4991}, {"confidence":0.5378}, {"confidence":0.0}]]
], ["metadata"])
df.select(explode(col("metadata")).alias("metadata")) \
.select(col("metadata")["confidence"].alias("value"))
# +------+
# |value |
# +------+
# |0.4991|
# |0.5378|
# |0.0 |
# +------+
Option 2 (transform + explode)
Here we use transform to extract the values of the map into a new array and then we explode it:
df.select(explode(expr("transform(metadata, i -> i['confidence'])")).alias("value"))

Related

Add new element to nested array of structs pyspark

I have a dataframe with the following schema using pyspark:
|-- suborders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- trackingStatusHistory: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- trackingStatusUpdatedAt: string (nullable = true)
| | | | |-- trackingStatus: string (nullable = true)
What I want to do is create a new deliveredat element for each suborders array using conditions.
I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.
How can I do this using pyspark?

You can do that using higher-order functions transform + filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:
import pyspark.sql.functions as F
df = df.withColumn(
"suborders",
F.expr("""transform(
suborders,
x -> struct(
filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
x.trackingStatusHistory as trackingStatusHistory
)
)
""")
)

Create a dictionary / map type column from an array column in pyspark such that the key should be same for all array elements

I have a spark data frame as below:
The schema of the dataframe is also given below:
|-- array_list: array (nullable = true)
| |-- element: string (containsNull = true)
|-- len_of_array: integer (nullable = false)
dataframe
+---------------+--------------+
|array_list |len_of_array |
+---------------------+--------+
|[t1, t2, t3] |3 |
|[t1, t2] |2 |
|[t2] |1 |
+---------------------+--------+
How can we create a new column-"mappings" wherein a constant key-"mapname" is mapped to each of the elements in array in column-"array_list"
The expected output is
+---------------+--------------+----------------------------------------------------+
|array_list |len_of_array |mappings |
+---------------------+--------+----------------------------------------------------+
|[t1, t2, t3] |3 |[[mapname -> t1], [mapname -> t2], [mapname -> t3]] |
|[t1, t2] |2 |[[mapname -> t1], [mapname -> t2]] |
|[t2] |1 |[[mapname -> t2]] |
+---------------------+--------+----------------------------------------------------+
mapname (i.e. the key) is a string i.e."mapname" and should be same for all array elements
I have created an additional column -"increasing_id" which has values -1,2,3 and then tried to define a UDF and use it in for loop to update each row of the mappings column, but its giving an error.
from pyspark.sql.functions import udf
from pyspark.sql import types as T
#udf(T.MapType(T.StringType(), T.StringType()))
def create_dict(name_value):
return {"name": name_value}
And then finally use a for loop to populate the column values:
for j in range(3):
result = df2.where(df2.increasing_id == j).select("len_of_array")
result = result.collect()[0][0]
df = df.where(df.increasing_id == j)\
.withColumn('mappings',
array([create_dict(df.array_list.getItem(i)) for i in range(0,result)]))
Error:
An error was encountered:
list index out of range
Traceback (most recent call last):
IndexError: list index out of range
The expected schema would look like this..
|-- mappings: array (nullable = false)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)

How to change a column type from "Array" to "String" with Pyspark?

I have a dataset containing a column with the following schema:
root
|-- id_: string (nullable = true)
|-- payload: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
where it can be seen that the second column, payload, contains lists of dictionaries as its entries. I would like to change the type of this column from array to stringand I have tried the following code, as suggested by https://sparkbyexamples.com/pyspark/pyspark-convert-array-column-to-string-column/ :
df = df.withColumn("payload", concat_ws(",",col("payload")))
However, I am getting an unexpected error (see below). I think it is due to the fact that the lists contained in each column entry store dictionaries. Does anyone know how to get around this problem?
argument 2 requires (array<string> or string) type, however,`payload` is of array<map<string,string>> type.;
Many thanks,
Marioanzas
EDIT AFTER #SRINIVAS proposed solution: I get the following error.
Syntax Error.
File "unnamed_3", line 7
df.withColumn("payload", F.expr(concat_ws(',',flatten(transform(payload,x -> transform(map_keys(x),y -> concat(y,x[y])))))))
^
SyntaxError: invalid syntax

Convert inside map key, value data to array of string then flatten data and pass result to concat_ws function.
Check below code.
df.printSchema
root
|-- id_: string (nullable = true)
|-- payload: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
df.show()
+----+----------------+
|id_ |payload |
+----+----------------+
|id_a|[[a -> a value]]|
|id_b|[[b -> b value]]|
|id_c|[[c -> c value]]|
+----+----------------+
df
.withColumn(
"payload",
F.expr("concat_ws(',',flatten(transform(payload,x -> transform(map_keys(x),y -> concat(y,x[y])))))")
).show()
+----+--------+
|id_ |payload |
+----+--------+
|id_a|aa value|
|id_b|bb value|
|id_c|cc value|
+----+--------+
Spark Version - 2.4

How access struct elements inside pyspark dataframe?

I have the following schema for a pyspark dataframe
root
|-- maindata: array (nullable = true)
| |-- element: array (containsNull = true)
| | |-- element: struct (containsNull = true)
| | | |-- label: string (nullable = true)
| | | |-- value: string (nullable = true)
| | | |-- unit: string (nullable = true)
| | | |-- dateTime: string (nullable = true)
Giving some data for a particular row which I received by df.select(F.col("maindata")).show(1,False)
|[[[a1, 43.24, km/h, 2019-04-06T13:02:08.020], [TripCount, 135, , 2019-04-06T13:02:08.790],["t2", 0, , 2019-04-06T13:02:08.040], [t4, 0, , 2019-04-06T13:02:08.050], [t09, 0, , 2019-04-06T13:02:08.050], [t3, 1, , 2019-04-06T13:02:08.050], [t7, 0, , 2019-04-06T13:02:08.050],[TripCount, ,136, 2019-04-06T13:02:08.790]]
I want access the tripcount value inside this ex: [TripCount -> 136,135 etc,What is the best way to access this data?TripC is present multiple times
and also is there any way to access say for example only label data like maindata.label..?

I would suggest to do explode multiple times, to convert array elements into individual rows, and then either convert struct into individual columns, or work with nested elements using the dot syntax. For example:
from pyspark.sql.functions import col, explode
df=spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d'))
>>> df2.printSchema()
root
|-- data: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
| |-- _3: string (nullable = true)
>>> df2.filter(col("data._1") == "k1").show()
+------------+
| data|
+------------+
|[k1, v1, v2]|
+------------+
or you can extract members of the struct as individual columns:
from pyspark.sql.functions import col, explode
df = spark.createDataFrame([[[[('k1','v1', 'v2')]]]], ['d'])
df2 = df.select(explode(col('d')).alias('d')).select(explode(col('d')).alias('d')).select("d.*").drop("d")
>>> df2.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: string (nullable = true)
>>> df2.filter(col("_1") == "k1").show()
+---+---+---+
| _1| _2| _3|
+---+---+---+
| k1| v1| v2|
+---+---+---+

nested json to tsv in databricks pyspark

Want to convert a nested json to tsv in databricks notebook using pysoark.
Below is json structure where columns can be changed.
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
I am new in databricks Please help

you have two ways to deal with this problem. Either you do some preprocessing in python with json library (or equivalent), or you load directly into pyspark and play around such as:
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
spark = SparkSession.builder.getOrCreate()
# your json
so_json = """
{"tables":[{"name":"Result","columns":[{"name":"JobTime","type":"datetime"},{"name":"Status","type":"string"}]
,"rows":[
["2020-04-19T13:45:12.528Z","Failed"]
,["2020-04-19T14:05:40.098Z","Failed"]
,["2020-04-19T13:46:31.655Z","Failed"]
,["2020-04-19T14:01:16.275Z","Failed"],
["2020-04-19T14:03:16.073Z","Failed"],
["2020-04-19T14:01:16.672Z","Failed"],
["2020-04-19T14:02:13.958Z","Failed"],
["2020-04-19T14:04:41.099Z","Failed"],
["2020-04-19T14:04:41.16Z","Failed"],
["2020-04-19T14:05:14.462Z","Failed"]
]}
]}
"""
# load in directly using read.json(), you'll see that this becomes
# a nested ArrayType/StructType wombo combo
json_df = spark.read.json(spark._sc.parallelize([so_json]))
json_df.printSchema()
root
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- columns: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
| | | | |-- type: string (nullable = true)
| | |-- name: string (nullable = true)
| | |-- rows: array (nullable = true)
| | | |-- element: array (containsNull = true)
| | | | |-- element: string (containsNull = true)
# select nested columns "tables" and "rows" and explode
array_df = json_df.select(f.explode(f.col('tables')['rows'][0]))
Exploding takes the rows which is ArrayType and splits it into actual rows.
Then you can subselect either by dot or slice notation
array_df.printSchema()
root
|-- col: array (nullable = true)
| |-- element: string (containsNull = true)
tabular_df = array_df.select(
array_df.col[0].alias("JobTime"),
array_df.col[1].alias("Status")
)
tabular_df.show()
+--------------------+------+
| JobTime|Status|
+--------------------+------+
|2020-04-19T13:45:...|Failed|
|2020-04-19T14:05:...|Failed|
|2020-04-19T13:46:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:03:...|Failed|
|2020-04-19T14:01:...|Failed|
|2020-04-19T14:02:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:04:...|Failed|
|2020-04-19T14:05:...|Failed|
+--------------------+------+
Finally, you want to save as CSV with a custom separator (\t). Hence:
tabular_df.write.csv("path/to/file.tsv", sep="\t")
NB: You may need to manually control for types, such as converting JobTime to TimestampType, but I'll leave that up to you.
Hope this helps.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark Extract Values from from Array of maps in structured streaming - python

Related

Add new element to nested array of structs pyspark

Create a dictionary / map type column from an array column in pyspark such that the key should be same for all array elements

How to change a column type from "Array" to "String" with Pyspark?

How access struct elements inside pyspark dataframe?

nested json to tsv in databricks pyspark

Categories

Resources