How to compare integer elements in PySpark dataframe array - python

I have a dataframe with a schema like this:
|-- gs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- g: struct (nullable = true)
| | | |-- calls: array (nullable = true)
| | | | |-- element: integer (containsNull = true)
the last row in this schema is an array/list where there should always be 2 integers, usually both 0.
I feed this into a function, where I compare the array like this:
((df.gs.g.calls[0] == 0) & (df.gs.g.calls[1] > 0))
which should work as simple as
but this is giving weird errors:
AnalysisException: u"cannot resolve '(gs.g.calls[0] = 0)' due to data type mismatch: differing types in '(gs.g.calls[0] = 0)' (array and int)
why isn't this working like simple python
some_list[3] == 4
which is array[int] and int?
How can I make a comparison on these integers?

Related

Pyspark Sedona: How to unnest column values to convert string to geometry?

I want to convert column from string to polygon with Sedona.
I am not quiet sure, if the problem is with 1 nested StructType too much, or another Sedona Function (other than ST_GeomFromWKB) is necessary, or i have to join columns "geometry_type" and "geometry_polygon" first. Anyone experienced & solved before?
My Dataframe looks like this:
root
|-- geo_name: string (nullable = true)
|-- geo_latitude: double (nullable = true)
|-- geo_longitude: double (nullable = true)
|-- geo_bst: integer (nullable = true)
|-- geo_bvw: integer (nullable = true)
|-- geometry_type: string (nullable = true)
|-- geometry_polygon: string (nullable = true)
The columns with geoinformation looks like this:
+--------------------+
| geometry_polygon|
+--------------------+
|[[[8.4937, 49.489...|
|[[[5.0946, 51.723...|
|[[[-8.5776, 43.54...|
|[[[-8.5762, 43.55...|
|[[[6.0684, 50.766...|
+--------------------+
+-------------+
|geometry_type|
+-------------+
| Polygon|
| Polygon|
| Polygon|
| Polygon|
| Polygon|
I tried:
station_groups_gdf.createOrReplaceTempView("station_gdf")
spatial_station_groups_gdf = spark_sedona.sql("SELECT *, ST_GeomFromWKB(station_gdf.geometry_polygon) AS geometry_shape FROM station_gdf")
Error Message is:
ERROR FormatUtils: [Sedona] Invalid hex digit: '['
ERROR Executor: Exception in task 0.0 in stage 57.0 (TID 53)
java.lang.IllegalArgumentException: Invalid hex digit: '['

How to merge two dataframes in pyspark with different columns inside struct or array?

Lets say, there are two data-frames. Reference dataframe and Target dataframe.
Reference DF is a reference schema.
Schema for reference DF (r_df)
r_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- mail: boolean (nullable = true)
| |-- sms: boolean (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
However, target data-frame schema is dynamic in nature.
Schema for target DF (t_df)
t_df.printSchema()
root
|-- _id: string (nullable = true)
|-- notificationsSend: struct (nullable = true)
| |-- sms: string (nullable = true)
|-- recordingDetails: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- channelName: string (nullable = true)
| | |-- fileLink: string (nullable = true)
| | |-- id: string (nullable = true)
| | |-- recorderId: string (nullable = true)
| | |-- resourceId: string (nullable = true)
| | |-- createdBy: string (nullable = true)
So we observe multiple changes in target's schema.
Columns inside t_df struct or array can have more or less columns.
Datatype of columns can change too. So type casting is required. (Ex. sms column is boolean in r_df but string in t_df)
I was able to add/remove columns which are of non-struct datatype. However, struct and arrays are real pain for me. Since there are 50+ columns, I need an optimised solution which works for all.
Any solution/ opinion/ way around will be really helpful.
Expected output
I want to make my t_df's schema exactly same as my r_df's schema.
below code is un-tested but should prescribe how to do it. (written from memory without testing.)
There may be a way to get fields from a struct but I'm not aware how so i'm interested to hear others ideas.
Extract struct column names and types.
Find columns that need to be dropped
Drop columns
rebuild struts according to r_df.
stucts_in_r_df = [ field.name for field in r_df.schema.fields if(str(field.dataType).startswith("Struct")) ] # use list comprehension to create a list of struct fields
struct_columns = []
for structs in stucts_in_r_df: # get a list of fields in the structs
struct_columns.append(r_df\
.select(
"$structs.*"
).columns
)
missingColumns = list(set(r_df.columns) - set(tdf.columns)) # find missing columns
similiar_Columns = list(set(r_df.columns).intersect(set(tdf.columns))))
#remove struct columns from both lists so you don't represent them twice.
# you need to repeat the above intersection/missing for the structs and then rebuild them but really the above gives you the idea of how to get the fields out.
# you can use variable replacemens col("$struct.$field") to get the values out of the fields,
result = r_df.union(
tdf\
.select(*(
[ lit(None).cast(dict(r_df.dtypes)[column]).alias(column) for column in missingColumns] +\
[ col(column).cast(dict(r_df.dtypes)[column]).alias(column) for column in similiar_Columns] ) # using list comprehension with joins and then passing as varargs to select will completely dynamically pull out the values you need.
)
)
Here's a way once you have the union to pull back the struct:
result = result\
.select(
col("_id"),
struct( col("sms").alias("sms") ).alias("notificationsSend"),
struct( *[col(column).alias(column) for column in struct_columns] # pass varags to struct with columns
).alias("recordingDetails") #reconstitue struct with
)

Add new element to nested array of structs pyspark

I have a dataframe with the following schema using pyspark:
|-- suborders: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- trackingStatusHistory: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- trackingStatusUpdatedAt: string (nullable = true)
| | | | |-- trackingStatus: string (nullable = true)
What I want to do is create a new deliveredat element for each suborders array using conditions.
I need to find the date within the trackingStatusHistory array where trackingStatusHistory.trackingStatus = 'delivered'. If this trackingStatus exists, the new deliveredat element will receive the date in trackingStatusHistory.trackingStatusUpdatedAt. If doesn't exist, receive null.
How can I do this using pyspark?
You can do that using higher-order functions transform + filter on arrays. For each struct element of suborders array you add a new field by filtering the sub-array trackingStatusHistory and getting the delivery date, like this:
import pyspark.sql.functions as F
df = df.withColumn(
"suborders",
F.expr("""transform(
suborders,
x -> struct(
filter(x.trackingStatusHistory, y -> y.trackingStatus = 'delivered')[0].trackingStatusUpdatedAt as deliveredAt,
x.trackingStatusHistory as trackingStatusHistory
)
)
""")
)

Filtering array as a column in dataframe

root
|-- id: string (nullable = true)
|-- elements: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- name: string (nullable = true)
| | |-- data: struct (nullable = true)
| | | |-- name: string (nullable = true)
| | | |-- surname: string (nullable = true)
| | |-- value: float (nullable = true)
| | |-- othername: string (nullable = true)
Having that dataframe structure, I'm trying to filter for elements in which value is greater than X e.g. 0.5. However when I try to filter it:
df.where(col('elements.value') > 0.5)
it throws
cannot resolve '(spark_catalog.default.tempD.`elements`.`value` > 0.5D)' due to data type mismatch:
differing types in '(spark_catalog.default.tempD.`elements`.`value` > 0.5D)' (array<float> and double).;;
I can't figure out how to fix that. Wrapping value with float() e.g. float(0.5) changes nothing.
I bet it is a simple fix, but I'm struggling with it too many hours.
You can try higher order expressions to filter the array:
df2 = df.selectExpr('id', 'filter(elements, x -> x.value > 0.5) filtered')
A normal where filter doesn't work because it cannot be applied onto an array. Imagine if your array contains two structs, one having value > 0.5 and the other value < 0.5. It's not possible to determine whether that row should be included or not.
If you want to filter the rows where ALL values in the array are > 0.5, you can use
df.where('array_min(transform(elements, x -> x.value > 0.5))')
the clause is only True if every item in the array returns True.

How to change a column type from "Array" to "String" with Pyspark?

I have a dataset containing a column with the following schema:
root
|-- id_: string (nullable = true)
|-- payload: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
where it can be seen that the second column, payload, contains lists of dictionaries as its entries. I would like to change the type of this column from array to stringand I have tried the following code, as suggested by https://sparkbyexamples.com/pyspark/pyspark-convert-array-column-to-string-column/ :
df = df.withColumn("payload", concat_ws(",",col("payload")))
However, I am getting an unexpected error (see below). I think it is due to the fact that the lists contained in each column entry store dictionaries. Does anyone know how to get around this problem?
argument 2 requires (array<string> or string) type, however,`payload` is of array<map<string,string>> type.;
Many thanks,
Marioanzas
EDIT AFTER #SRINIVAS proposed solution: I get the following error.
Syntax Error.
File "unnamed_3", line 7
df.withColumn("payload", F.expr(concat_ws(',',flatten(transform(payload,x -> transform(map_keys(x),y -> concat(y,x[y])))))))
^
SyntaxError: invalid syntax
Convert inside map key, value data to array of string then flatten data and pass result to concat_ws function.
Check below code.
df.printSchema
root
|-- id_: string (nullable = true)
|-- payload: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
df.show()
+----+----------------+
|id_ |payload |
+----+----------------+
|id_a|[[a -> a value]]|
|id_b|[[b -> b value]]|
|id_c|[[c -> c value]]|
+----+----------------+
df
.withColumn(
"payload",
F.expr("concat_ws(',',flatten(transform(payload,x -> transform(map_keys(x),y -> concat(y,x[y])))))")
).show()
+----+--------+
|id_ |payload |
+----+--------+
|id_a|aa value|
|id_b|bb value|
|id_c|cc value|
+----+--------+
Spark Version - 2.4

Categories

Resources