Optimization on for loop on columns in Pyspark

Optimization on for loop on columns in Pyspark - python

I don't know if my title is very clear. I have a table with a lot columns (more than a hundred). Some of my columns contains values with brackets and I need to explode them into several rows. Here is a reproducible example:
# Import libraries
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import *
import pandas as ps
# Create an example
columns = ["Name", "Age", "Activity", "Studies"]
data = [("Jame", 25, "[Painting,Yoga]", "[Math,Physics]"), ("Anne", 20, "[Garden,Cooking,Travel]", "[Communication,Marketing]"), ("Jane", 10, "[Gymnastique]", "[Basic School]")]
df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)
it shows the following table:
+----+---+-----------------------+-------------------------+
|Name|Age|Activity |Studies |
+----+---+-----------------------+-------------------------+
|Jame|25 |[Painting,Yoga] |[Math,Physics] |
|Anne|20 |[Garden,Cooking,Travel]|[Communication,Marketing]|
|Jane|10 |[Gymnastique] |[Basic School] |
+----+---+-----------------------+-------------------------+
I need to determine what columns contains brackets as value:
list_col = df.dtypes
df_array_col = spark.createDataFrame(list_col)\
.withColumnRenamed("_1", "Colname")\
.withColumnRenamed("_2", "TypeColumn")\
.filter(col("TypeColumn") == "string")\
.withColumn("IsBracket", lit(0))\
.toPandas()
# Function for determining what column contains brackets as a value
def func_isSquaredBracket(my_col):
A = df.select(first(col(my_col).rlike("\["), ignorenulls=True).alias(my_col))
val_IsBracket = A.select(col(my_col)).collect()[0][0]
return val_IsBracket
# For loop for applying the function
n_array = df_array_col.count()["Colname"]
for index, row in df_array_col.iterrows():
IsBracket_value = func_isSquaredBracket(df_array_col.at[index, "Colname"])
if IsBracket_value == True:
df_array_col.at[index, "IsBracket"] = 1
I succeed what columns have brackets as value. Now I can explode my table:
def func_extractStringInBracket_andSplit(my_col):
extract_string = regexp_extract(my_col, r'(?<=\[).+?(?=\])', 0).alias(my_col)
string_split = split(extract_string, "\||,").alias(my_col)
string_explode_array = explode_outer(string_split).alias(my_col)
return string_explode_array
df_explode_bracket = df
for index, row in df_array_bracket_col.iterrows():
colname = df_array_bracket_col["Colname"][index]
df_explode_bracket = df_explode_bracket.withColumn(colname, func_extractStringInBracket_andSplit(colname))
df_explode_bracket.show(truncate=False)
I obtain the result I want:
+----+---+-----------+-------------+
|Name|Age|Activity |Studies |
+----+---+-----------+-------------+
|Jame|25 |Painting |Math |
|Jame|25 |Painting |Physics |
|Jame|25 |Yoga |Math |
|Jame|25 |Yoga |Physics |
|Anne|20 |Garden |Communication|
|Anne|20 |Garden |Marketing |
|Anne|20 |Cooking |Communication|
|Anne|20 |Cooking |Marketing |
|Anne|20 |Travel |Communication|
|Anne|20 |Travel |Marketing |
|Jane|10 |Gymnastique|Basic School |
+----+---+-----------+-------------+
However, this solution is not optimized when I have more than 100 columns and it takes more than 6 minutes to get the result with the following message:
/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below:
'JavaPackage' object is not callable
Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
warnings.warn(msg)
I am pretty new to PySpark and I am not an expert in Python. My question is: How can I optimize the solution by using PySpark instead of Pandas? For loop is not ideal when you have the opportunity to use parallel processing.

It's actually pretty easy, use regexp_extract_all:
df = (
df.withColumn("Activity_list", F.expr(r"regexp_extract_all(Activity, '(\\w+)', 1)"))
.withColumn("Studies_list", F.expr(r"regexp_extract_all(Studies, '(\\w+)', 1)"))
)
df = (
df.drop("Activity", "Studies")
.withColumn("Activity", F.explode("Activity_list"))
.withColumn("Studies", F.explode("Studies_list"))
)
Edit: It even works with strings without brackets.

Related

PySpark Convert Column<> to Value

I'd like to know how to get the value of a calculation done using functions such as date_add, datediff, date_sub, etc. The actual value of it in a variable.
As an example:
start_date = '2022-03-06'
end_date = '2022-03-01'
date_lag = datediff(to_date(lit(start_date)), to_date(lit(end_date)))
If I run date_lag, the output is: Column<'datediff(to_date(2022-03-06), to_date(2022-03-01))'>.
The expected output would be 5.
I was told by a coworker, I'd have to create a dataframe, apply the column expression and then apply a collect to get the value, but I was hoping there would be a simpler way to do it.

You have used PySpark functions datediff, to_date, lit. They all return a column data type. Columns (also the results of your calculations) do not exist unless you add them to a dataframe AND return the dataframe in some way.
So, your colleague was correct telling that first you need to create a dataframe (which will hold your column) and then, since you want your value in a variable, you will have to tell which record from that column you want to take (this can be done using either collect, head, take, first,..)
Creating a dataframe with 3 records and adding your column to it:
from pyspark.sql import functions as F
start_date = '2022-03-06'
end_date = '2022-03-01'
date_lag = F.datediff(F.to_date(F.lit(start_date)), F.to_date(F.lit(end_date)))
df = spark.range(3).select(
date_lag.alias('column_name')
)
df.show()
# +-----------+
# |column_name|
# +-----------+
# | 5|
# | 5|
# | 5|
# +-----------+
Any of the following lines will write the top row's value of your column into a variable.
date_lag_var = df.head().column_name
date_lag_var = df.first().column_name
date_lag_var = df.take(1)[0].column_name
date_lag_var = df.limit(1).collect()[0].column_name

you can easily do it using python
>>> start_date = '2022-03-06'
>>> end_date = '2022-03-01'
>>> str_d1=start_date.split("-")[0]+"/"+start_date.split("-")[1]+"/"+start_date.split("-")[2]
>>> str_d1
'2022/03/06'
>>> str_d2=end_date.split("-")[0]+"/"+end_date.split("-")[1]+"/"+end_date.split("-")[2]
>>> str_d2
'2022/03/01'
>>> d1 = datetime.strptime(str_d1, "%Y/%m/%d")
>>> d2 = datetime.strptime(str_d2, "%Y/%m/%d")
>>> delta = d1-d2
>>> delta.days
5

Counting nulls or zeros in PySpark data frame with struct column types

I have a PySpark data frame that has a mix of integer columns, string columns, and also struct columns. A struct column could be a struct, but it could also just be null. For example:
id | mystring | mystruct |
--------------------------
1 | something | <struct>|
2 | something | null |
3 | 0 | null |
4 | something | null |
5 | something | <struct> |
Is there any easy way to go through the entire data frame and get the count of null/na/0 values without having to explode the struct columns? For example, I would want for above
id | mystring | mystruct |
--------------------------
0 | 1 | 3
I've seen a few different methods but they always seem to throw an error with the struct types, and I'd rather not have to do them separately.

Not exactly an easy way, but you could define a function to handle the nulls (all input types) and nans/zeros (for numeric inputs) for each column. Then you can join the results for each column separately.
from pyspark.sql import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf
from pyspark.sql.types import *
from pyspark.sql.functions import monotonically_increasing_id
conf = SparkConf()
sc = SparkContext(conf=conf)
spark = SparkSession(sc)
# setup
data = [[1, {'f':[1,2,3]}], [2, None],[0, None], [1, None], [3, {'f':[1]}]]
schema = StructType([
StructField('mynum', IntegerType(), True),
StructField('mystruct',
StructType([StructField('f', ArrayType(IntegerType()), True)]), True)
])
rdd = spark.sparkContext.parallelize(data)
df = spark.createDataFrame(rdd, schema)
def get_nulls_nans_zeros(c, df):
# all inputs
val = df.select(count(when(isnull(c), c)).alias(c))
t = type(df.schema[c].dataType)
# numeric inputs
if t in [ByteType, ShortType, IntegerType, LongType, FloatType, DoubleType, DecimalType]:
val = val.union(df.select(count(when(isnan(c), c)).alias(c)))
val = val.union(df.select(count(when(col(c) == 0, c)).alias(c)))
return val.select(sum(c).alias(c))
# Get and merge results for each column
res = [get_nulls_nans_zeros(c, df) for c in df.columns]
res = [r.withColumn("id", monotonically_increasing_id()) for r in res]
result = res[0].join(res[1], "id", "outer").drop("id")
result.show()
If you're using Spark 3.1+, you can also use the allowMissingColumns flag in unionByName to do the last part instead of having to join via a monotonically increasing id.

Pandas UDF in pyspark

I am trying to fill a series of observation on a spark dataframe. Basically I have a list of days and I should create the missing one for each group.
In pandas there is the reindex function, which is not available in pyspark.
I tried to implement a pandas UDF:
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
This looks like should do what I need, however it fails with this message
AttributeError: Can only use .dt accessor with datetimelike values
. What am I doing wrong here?
Here the full code:
data = spark.createDataFrame(
[(1, "2020-01-01", 0),
(1, "2020-01-03", 42),
(2, "2020-01-01", -1),
(2, "2020-01-03", -2)],
('id', 'dates', 'value'))
data = data.withColumn('dates', col('dates').cast("date"))
schema = StructType([
StructField('id', IntegerType()),
StructField('dates', DateType()),
StructField('value', DoubleType())])
#pandas_udf(schema, functionType=PandasUDFType.GROUPED_MAP)
def reindex_by_date(df):
df = df.set_index('dates')
dates = pd.date_range(df.index.min(), df.index.max())
return df.reindex(dates, fill_value=0).ffill()
data = data.groupby('id').apply(reindex_by_date)
Ideally I would like something like this:
+---+----------+-----+
| id| dates|value|
+---+----------+-----+
| 1|2020-01-01| 0|
| 1|2020-01-02| 0|
| 1|2020-01-03| 42|
| 2|2020-01-01| -1|
| 2|2020-01-02| 0|
| 2|2020-01-03| -2|
+---+----------+-----+

Case 1: Each ID has an individual date range.
I would try to reduce the content of the udf as much as possible. In this case I would only calculate the date range per ID in the udf. For the other parts I would use Spark native functions.
from pyspark.sql import types as T
from pyspark.sql import functions as F
# Get min and max date per ID
date_ranges = data.groupby('id').agg(F.min('dates').alias('date_min'), F.max('dates').alias('date_max'))
# Calculate the date range for each ID
#F.udf(returnType=T.ArrayType(T.DateType()))
def get_date_range(date_min, date_max):
return [t.date() for t in list(pd.date_range(date_min, date_max))]
# To get one row per potential date, we need to explode the UDF output
date_ranges = date_ranges.withColumn(
'dates',
F.explode(get_date_range(F.col('date_min'), F.col('date_max')))
)
date_ranges = date_ranges.drop('date_min', 'date_max')
# Add the value for existing entries and add 0 for others
result = date_ranges.join(
data,
['id', 'dates'],
'left'
)
result = result.fillna({'value': 0})
Case 2: All ids have the same date range
I think there is no need to use a UDF here. What you want to can be archived in a different way: First, you get all possible IDs and all necessary dates. Second, you crossJoin them, which will provide you with all possible combinations. Third, left join the original data onto the combinations. Fourth, replace the occurred null values with 0.
# Get all unique ids
ids_df = data.select('id').distinct()
# Get the date series
date_min, date_max = data.agg(F.min('dates'), F.max('dates')).collect()[0]
dates = [[t.date()] for t in list(pd.date_range(date_min, date_max))]
dates_df = spark.createDataFrame(data=dates, schema="dates:date")
# Calculate all combinations
all_comdinations = ids_df.crossJoin(dates_df)
# Add the value column
result = all_comdinations.join(
data,
['id', 'dates'],
'left'
)
# Replace all null values with 0
result = result.fillna({'value': 0})
Please be aware of the following limitiations with this solution:
crossJoins can be quite costly. One potential solution to cope with the issue can be found in this related question.
The collect statement and use of Pandas results in a not perfectly parallelised Spark transformation.
[EDIT] Split into two cases as I first thought all IDs have the same date range.

spark "package.TreeNodeException" error python "java.lang.RuntimeException: Couldn't find pythonUDF"

I'm using pySpark 2.1 on Databricks.
I've written a UDF to generate a unique uuid for each row of a pyspark dataframe. The dataframes I'm working with are relatively small < 10,000 rows. And should never grow beyond that.
I know that there are built-in functions spark functions zipWithIndex() and zipWithUniqueId() to generate row indices, but I've been asked specifically to use uuid's for this particular project.
The UDF udf_insert_uuid works fine on small data sets, but seems to be clashing with the built-in spark function subtract.
What's causing this error:
package.TreeNodeException: Binding attribute, tree: pythonUDF0#104830
Deeper in the driver stack errors it also says:
Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#104830
This is code I'm running below:
create a function to generate a set of unique_ids
import pandas
from pyspark.sql.functions import *
from pyspark.sql.types import *
import uuid
#define a python function
def insert_uuid():
user_created_uuid = str( uuid.uuid1() )
return user_created_uuid
#register the python function for use in dataframes
udf_insert_uuid = udf(insert_uuid, StringType())
create a dataframe with 50 elements
import pandas
from pyspark.sql.functions import *
from pyspark.sql.types import *
list_of_numbers = range(1000,1050)
temp_pandasDF = pandas.DataFrame(list_of_numbers, index=None)
sparkDF = (
spark
.createDataFrame(temp_pandasDF, ["data_points"])
.withColumn("labels", when( col("data_points") < 1025, "a" ).otherwise("b")) #if "values" < 25, then "labels" = "a", else "labels" = "b"
.repartition("labels")
)
sparkDF.createOrReplaceTempView("temp_spark_table")
#add a unique id for each row
#udf works fine in the line of code here
sparkDF = sparkDF.withColumn("id", lit( udf_insert_uuid() ))
sparkDF.show(20, False)
ssparkDF output:
+-----------+------+------------------------------------+
|data_points|labels|id |
+-----------+------+------------------------------------+
|1029 |b |d3bb91e0-9cc8-11e7-9b70-00163e9986ba|
|1030 |b |d3bb95e6-9cc8-11e7-9b70-00163e9986ba|
|1035 |b |d3bb982a-9cc8-11e7-9b70-00163e9986ba|
|1036 |b |d3bb9a50-9cc8-11e7-9b70-00163e9986ba|
|1042 |b |d3bb9c6c-9cc8-11e7-9b70-00163e9986ba|
+-----------+------+------------------------------------+
only showing top 5 rows
create another DF with values different from sparkDF
list_of_numbers = range(1025,1075)
temp_pandasDF = pandas.DataFrame(list_of_numbers, index=None)
new_DF = (
spark
.createDataFrame(temp_pandasDF, ["data_points"])
.withColumn("labels", when( col("data_points") < 1025, "a" ).otherwise("b")) #if "values" < 25, then "labels" = "a", else "labels" = "b"
.repartition("labels"))
new_DF.show(5, False)
new_DF output:
+-----------+------+
|data_points|labels|
+-----------+------+
|1029 |b |
|1030 |b |
|1035 |b |
|1036 |b |
|1042 |b |
+-----------+------+
only showing top 5 rows
compare the values in new_DF with spark_DF
values_not_in_new_DF = (new_DF.subtract(sparkDF.drop("id")))
add the uuid to each row of the udf and display it
display(values_not_in_new_DF
.withColumn("id", lit( udf_insert_uuid())) #add a column of unique uuid's
)
The following error results:
package.TreeNodeException: Binding attribute, tree: pythonUDF0#104830
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF0#104830 at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:88) at
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:87) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:268) at
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:267) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:273) at
org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:307) at
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:188) at
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:305) at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:273) at
org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:257) at
org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:87) at
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:473) at
org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$33.apply(HashAggregateExec.scala:472) at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at
scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at
scala.collection.AbstractTraversable.map(Traversable.scala:105) at
org.apache.spark.sql.execution.aggregate.HashAggregateExec.generateResultCode(HashAggregateExec.scala:472) at
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduceWithKeys(HashAggregateExec.scala:610) at
org.apache.spark.sql.execution.aggregate.HashAggregateExec.doProduce(HashAggregateExec.scala:148) at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:83) at
org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:78) at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at
org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:78) at
org.apache.spark.sql.execution.aggregate.HashAggregateExec.produce(HashAggregateExec.scala:38) at
org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:313) at
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:354) at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) at
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) at
org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) at
org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:225) at
org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:308) at
org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at
org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:2807) at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2132) at
org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2132) at
org.apache.spark.sql.Dataset$$anonfun$60.apply(Dataset.scala:2791) at
org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:87) at
org.apache.spark.sql.execution.SQLExecution$.withFileAccessAudit(SQLExecution.scala:53) at
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:70) at
org.apache.spark.sql.Dataset.withAction(Dataset.scala:2790) at
org.apache.spark.sql.Dataset.head(Dataset.scala:2132) at
org.apache.spark.sql.Dataset.take(Dataset.scala:2345) at
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation0(OutputAggregator.scala:81) at
com.databricks.backend.daemon.driver.OutputAggregator$.withOutputAggregation(OutputAggregator.scala:42) at
com.databricks.backend.daemon.driver.PythonDriverLocal$$anonfun$getResultBuffer$1.apply(PythonDriverLocal.scala:461) at
com.databricks.backend.daemon.driver.PythonDriverLocal$$anonfun$getResultBuffer$1.apply(PythonDriverLocal.scala:441) at
com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:394) at
com.databricks.backend.daemon.driver.PythonDriverLocal.getResultBuffer(PythonDriverLocal.scala:441) at
com.databricks.backend.daemon.driver.PythonDriverLocal.com$databricks$backend$daemon$driver$PythonDriverLocal$$outputSuccess(PythonDriverLocal.scala:428) at
com.databricks.backend.daemon.driver.PythonDriverLocal$$anonfun$repl$3.apply(PythonDriverLocal.scala:178) at
com.databricks.backend.daemon.driver.PythonDriverLocal$$anonfun$repl$3.apply(PythonDriverLocal.scala:175) at
com.databricks.backend.daemon.driver.PythonDriverLocal.withInterpLock(PythonDriverLocal.scala:394) at
com.databricks.backend.daemon.driver.PythonDriverLocal.repl(PythonDriverLocal.scala:175) at
com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$2.apply(DriverLocal.scala:230) at
com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$2.apply(DriverLocal.scala:211) at
com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:173) at
scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at
com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:168) at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:39) at
com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:206) at
com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:39) at
com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:211) at
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589) at
com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:589) at
scala.util.Try$.apply(Try.scala:161) at
com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:584) at
com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:488) at
com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:391) at
com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:348) at
com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:215) at
java.lang.Thread.run(Thread.java:745) Caused by: java.lang.RuntimeException: Couldn't find pythonUDF0#104830 in [data_points#104799L,labels#104802] at
scala.sys.package$.error(package.scala:27) at
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:94) at
org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:88) at
org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ... 82 more

I get the same error as you when I run your script. The only way I found to make it work is to pass the UDF a column instead of no argument:
def insert_uuid(col):
user_created_uuid = str( uuid.uuid1() )
return user_created_uuid
udf_insert_uuid = udf(insert_uuid, StringType())
and then call it on labels for instance:
values_not_in_new_DF\
.withColumn("id", udf_insert_uuid("labels"))\
.show()
no need to use lit

How to create dataframe with single header ( 1 row many cols) and update values to this dataframe in pyspark?

I want to create a dataframe in pyspark like the table below :
category| category_id| bucket| prop_count| event_count | accum_prop_count | accum_event_count
-----------------------------------------------------------------------------------------------------
nation | nation | 1 | 222 | 444 | 555 | 6677
So, the code I tried below :
schema = StructType([])
df = sqlContext.createDataFrame(sc.emptyRDD(), schema)
df = df.withColumn("category",F.lit('nation')).withColumn("category_id",F.lit('nation')).withColumn("bucket",bucket)
df = df.withColumn("prop_count",prop_count).withColumn("event_count",event_count).withColumn("accum_prop_count",accum_prop_count).withColumn("accum_event_count",accum_event_count)
df.show()
This is giving an error :
AssertionError: col should be Column
Also, The values of the columns have to be updated again later and the update will also be of 1 line.
How to do this??

I think the problem with your code is lies in lines where you are using variables like .withColumn("bucket",bucket). You are trying to create a new column by giving an integer value. withColumn expects a column and not a single integer value.
To solve this, you can use the lit just like you are already using for "nation"
like :
df = df\
.withColumn("category",F.lit('nation'))\
.withColumn("category_id",F.lit('nation'))\
.withColumn("bucket",F.lit(bucket))\
.withColumn("prop_count",F.lit(prop_count))\
.withColumn("event_count",F.lit(event_count))\
.withColumn("accum_prop_count",F.lit(accum_prop_count))\
.withColumn("accum_event_count",F.lit(accum_event_count))
another simple and cleaner way to write it may be like this :
# create schema
fields = [StructField("category", StringType(),True),
StructField("category_id", StringType(),True),
StructField("bucket", IntegerType(),True),
StructField("prop_count", IntegerType(),True),
StructField("event_count", IntegerType(),True),
StructField("accum_prop_count", IntegerType(),True)
]
schema = StructType(fields)
# load data
data = [["nation","nation",1,222,444,555]]
df = spark.createDataFrame(data, schema)
df.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimization on for loop on columns in Pyspark - python

Related

PySpark Convert Column<> to Value

Counting nulls or zeros in PySpark data frame with struct column types

Pandas UDF in pyspark

spark "package.TreeNodeException" error python "java.lang.RuntimeException: Couldn't find pythonUDF"

How to create dataframe with single header ( 1 row many cols) and update values to this dataframe in pyspark?

Categories

Resources