Pyspark Sedona: How to unnest column values to convert string to geometry? - python

I want to convert column from string to polygon with Sedona.
I am not quiet sure, if the problem is with 1 nested StructType too much, or another Sedona Function (other than ST_GeomFromWKB) is necessary, or i have to join columns "geometry_type" and "geometry_polygon" first. Anyone experienced & solved before?
My Dataframe looks like this:
root
|-- geo_name: string (nullable = true)
|-- geo_latitude: double (nullable = true)
|-- geo_longitude: double (nullable = true)
|-- geo_bst: integer (nullable = true)
|-- geo_bvw: integer (nullable = true)
|-- geometry_type: string (nullable = true)
|-- geometry_polygon: string (nullable = true)
The columns with geoinformation looks like this:
+--------------------+
| geometry_polygon|
+--------------------+
|[[[8.4937, 49.489...|
|[[[5.0946, 51.723...|
|[[[-8.5776, 43.54...|
|[[[-8.5762, 43.55...|
|[[[6.0684, 50.766...|
+--------------------+
+-------------+
|geometry_type|
+-------------+
| Polygon|
| Polygon|
| Polygon|
| Polygon|
| Polygon|
I tried:
station_groups_gdf.createOrReplaceTempView("station_gdf")
spatial_station_groups_gdf = spark_sedona.sql("SELECT *, ST_GeomFromWKB(station_gdf.geometry_polygon) AS geometry_shape FROM station_gdf")
Error Message is:
ERROR FormatUtils: [Sedona] Invalid hex digit: '['
ERROR Executor: Exception in task 0.0 in stage 57.0 (TID 53)
java.lang.IllegalArgumentException: Invalid hex digit: '['

Related

Comparing two values in a structfield of a column in pyspark

I have Column where each row is a StructField. I want to get max of two values in the StructField.
I tried this
trends_df = trends_df.withColumn("importance_score", max(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"], key=max_key))
But it throws this error
ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
I am now getting it done with UDFs
max_key = lambda x: x if x else float("-inf")
_get_max_udf = udf(lambda x, y: max(x,y, key=max_key), FloatType())
trends_df = trends_df.withColumn("importance_score", _get_max_udf(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
This works, but I want to know if there a way I can avoid using the udf and get it done with just spark.
Edit:
This is the result of trends_df.printSchema()
root
|-- avg_total: struct (nullable = true)
| |-- max: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- max_index: long (nullable = true)
| | |-- max_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)
| |-- min: struct (nullable = true)
| | |-- avg_percent: double (nullable = true)
| | |-- min_index: long (nullable = true)
| | |-- min_val: long (nullable = true)
| | |-- total_percent: double (nullable = true)
| | |-- total_val: long (nullable = true)
Adding an answer from the comments to highlight it.
As answered by #smurphy I used the greatest function
trends_df = trends_df.withColumn("importance_score", greatest(col("avg_total")["max"]["agg_importance"], col("avg_total")["min"]["agg_importance"]))
https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.greatest

pySpark: How can I get all element names in structType in arrayType column in a dataframe?

I have a dataframe that looks something like this:
|-- name: string (nullable = true)
|-- age: string (nullable = true)
|-- job: string (nullable = true)
|-- hobbies: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- favorite: string (nullable = true)
| | |-- non-favorite: string (nullable = true)
And I'm trying to get this information:
['favorite', 'non-favorite']
However, the only closest solution I found was using the explode function with withColumn, but it was based on the assumption that I already know the names of the elements. But What I want to do is, without knowing the element names, I want to get the element names only with the column name, in this case 'hobbies'.
Is there a good way to get all the element names in any given column?
For a given dataframe with this schema:
df.printSchema()
root
|-- hobbies: array (nullable = false)
| |-- element: struct (containsNull = false)
| | |-- favorite: string (nullable = false)
| | |-- non-favorite: string (nullable = false)
You can select the field names of the struct as:
struct_fields = df.schema['hobbies'].dataType.elementType.fieldNames()
# output: ['favorite', 'non-favorite']
pyspark.sql.types.StructType.fieldnames should get you what you want.
fieldNames()
Returns all field names in a list.
>>> struct = StructType([StructField("f1", StringType(), True)])
>>> struct.fieldNames()
['f1']
So in your case something like
dataframe.hobbies.getItem(0).fieldnames()

How to change a column type from "Array" to "String" with Pyspark?

I have a dataset containing a column with the following schema:
root
|-- id_: string (nullable = true)
|-- payload: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
where it can be seen that the second column, payload, contains lists of dictionaries as its entries. I would like to change the type of this column from array to stringand I have tried the following code, as suggested by https://sparkbyexamples.com/pyspark/pyspark-convert-array-column-to-string-column/ :
df = df.withColumn("payload", concat_ws(",",col("payload")))
However, I am getting an unexpected error (see below). I think it is due to the fact that the lists contained in each column entry store dictionaries. Does anyone know how to get around this problem?
argument 2 requires (array<string> or string) type, however,`payload` is of array<map<string,string>> type.;
Many thanks,
Marioanzas
EDIT AFTER #SRINIVAS proposed solution: I get the following error.
Syntax Error.
File "unnamed_3", line 7
df.withColumn("payload", F.expr(concat_ws(',',flatten(transform(payload,x -> transform(map_keys(x),y -> concat(y,x[y])))))))
^
SyntaxError: invalid syntax
Convert inside map key, value data to array of string then flatten data and pass result to concat_ws function.
Check below code.
df.printSchema
root
|-- id_: string (nullable = true)
|-- payload: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: string (valueContainsNull = true)
df.show()
+----+----------------+
|id_ |payload |
+----+----------------+
|id_a|[[a -> a value]]|
|id_b|[[b -> b value]]|
|id_c|[[c -> c value]]|
+----+----------------+
df
.withColumn(
"payload",
F.expr("concat_ws(',',flatten(transform(payload,x -> transform(map_keys(x),y -> concat(y,x[y])))))")
).show()
+----+--------+
|id_ |payload |
+----+--------+
|id_a|aa value|
|id_b|bb value|
|id_c|cc value|
+----+--------+
Spark Version - 2.4

Does Schema depend on first row while converting RDD to DataFrame in pyspark?

My Question is while converting from Rdd to dataframe in pyspark does the schema depends on the first row ?
data1 = [('A','abc',0.1,'',0.562),('B','def',0.15,0.5,0.123),('A','ghi',0.2,0.2,0.1345),('B','jkl','',0.1,0.642),('B','mno',0.1,0.1,'')]
>>> val1=sc.parallelize(data1).toDF()
>>> val1.show()
+---+---+----+---+------+
| _1| _2| _3| _4| _5|
+---+---+----+---+------+
| A|abc| 0.1| | 0.562| <------ Does it depends on type of this row?
| B|def|0.15|0.5| 0.123|
| A|ghi| 0.2|0.2|0.1345|
| B|jkl|null|0.1| 0.642|
| B|mno| 0.1|0.1| null|
+---+---+----+---+------+
>>> val1.printSchema()
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: double (nullable = true)
|-- _4: string (nullable = true)
|-- _5: double (nullable = true)
As you can see column _4 should have been double but it considered as string.
Any Suggestions will be helpfull.
Thanks!
#Prathik, I think you are right.
toDF() is a shorthand for spark.createDataFrame(rdd, schema, sampleRatio).
Here's the signature for createDataFrame:
def createDataFrame(self, data, schema=None, samplingRatio=None, verifySchema=True)
So by default, the parameters schema and samplingRatio are None.
According to the doc:
If schema inference is needed, samplingRatio is used to determined the ratio of
rows used for schema inference. The first row will be used if samplingRatio is None.
So by default, toDF() will use the first row to infer the data type, which it figures StringType for column 4, but FloatType for column 5.
Here you can't specify the schema to be FloatType for column 4 and 5, since they have strings in their columns.
But you can try set sampleRatio to 0.3 as below:
data1 = [('A','abc',0.1,'',0.562),('B','def',0.15,0.5,0.123),('A','ghi',0.2,0.2,0.1345),('B','jkl','',0.1,0.642),('B','mno',0.1,0.1,'')]
val1=sc.parallelize(data1).toDF(sampleRatio=0.3)
val1.show()
val1.printSchema()
Some times the above code will throw out error if it happens to sample the string row
Can not merge type <class 'pyspark.sql.types.DoubleType'> and <class 'pyspark.sql.types.StringType'>
but if you are patient and try more times (< 10 for me), you may get something like this. And you can see that both column 4 and 5 are FloatType, because by luck, the program picked double numbers while running createDataFrame.
+---+---+----+----+------+
| _1| _2| _3| _4| _5|
+---+---+----+----+------+
| A|abc| 0.1|null| 0.562|
| B|def|0.15| 0.5| 0.123|
| A|ghi| 0.2| 0.2|0.1345|
| B|jkl|null| 0.1| 0.642|
| B|mno| 0.1| 0.1| null|
+---+---+----+----+------+
root
|-- _1: string (nullable = true)
|-- _2: string (nullable = true)
|-- _3: double (nullable = true)
|-- _4: double (nullable = true)
|-- _5: double (nullable = true)

Convert PySpark dataframe column type to string and replace the square brackets

I need to convert a PySpark df column type from array to string and also remove the square brackets. This is the schema for the dataframe. columns that needs to be processed is CurrencyCode and TicketAmount
>>> plan_queryDF.printSchema()
root
|-- event_type: string (nullable = true)
|-- publishedDate: string (nullable = true)
|-- plannedCustomerChoiceID: string (nullable = true)
|-- assortedCustomerChoiceID: string (nullable = true)
|-- CurrencyCode: array (nullable = true)
| |-- element: string (containsNull = true)
|-- TicketAmount: array (nullable = true)
| |-- element: string (containsNull = true)
|-- currentPlan: boolean (nullable = true)
|-- originalPlan: boolean (nullable = true)
|-- globalId: string (nullable = true)
|-- PlanJsonData: string (nullable = true)
sample data from dataframe
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| [GBP]| [0]| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| [CNY]| [329]| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| [JPY]| [3400]| true| false|000576058003|{"httpStatus":200...|
how can I do it? Currently I am doing a cast to string and then replacing the square braces with regexp_replace. but this approach fails when I process huge amount of data.
Is there any other way I can do it?
This is what I want.
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
| event_type| publishedDate|plannedCustomerChoiceID|assortedCustomerChoiceID|CurrencyCode|TicketAmount|currentPlan|originalPlan| globalId| PlanJsonData|
+--------------------+--------------------+-----------------------+------------------------+------------+------------+-----------+------------+------------+--------------------+
|PlannedCustomerCh...|2016-08-23T04:46:...| 087d1ff1-5f3a-496...| 2539cc4a-37e5-4f3...| GBP| 0| false| false|000576015000|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T04:30:...| 0a1af217-d1e8-4ab...| 61bc5fda-0160-484...| CNY| 329| true| false|000189668017|{"httpStatus":200...|
|PlannedCustomerCh...|2016-08-23T05:49:...| 1028b477-f93e-47f...| c6d5b761-94f2-454...| JPY| 3400| true| false|000576058003|{"httpStatus":200...|
You can try getItem(0):
df \
.withColumn("CurrencyCode", df["CurrencyCode"].getItem(0).cast("string")) \
.withColumn("TicketAmount", df["TicketAmount"].getItem(0).cast("string"))
The final cast to string is optional.

Categories

Resources