Spark Dataframe - using User Defined Function to add a column - python

I'm still in a learning stage of python. In the following example (taken from Method 3 of this article), the name of the User Defined Function (UDF) is Total(...,...). But the author is calling it with a name new_f(...,...).
Question: In the code below, how do we know that the function call new_f(...,...) should call the function Total(...,...)? What if there was another UDF function, say, Sum(...,...). In that case, how the code would have known whether call new_f(...,...) means calling Total(...,...) or Sum(...,...)?
# import the functions as F from pyspark.sql
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define the sum_col
def Total(Course_Fees, Discount):
res = Course_Fees - Discount
return res
# integer datatype is defined
new_f = F.udf(Total, IntegerType())
# calling and creating the new
# col as udf_method_sum
new_df = df.withColumn(
"Total_price", new_f("Course_Fees", "Discount"))
# Showing the Dataframe
new_df.show()

new_f = F.udf(Total, IntegerType())
assigns the name new_f to that user defined function

Related

How to create a new column with a null value using Pyspark DataFrame?

I'm having issues with using pyspark dataframes. I have a column called eventkey which is a concatenation of the following elements: account_type, counter_type and billable_item_sid. I have a function called apply_event_key_transform in which I want to break up the concatenated eventkey and create new columns for each of the elements.
def apply_event_key_transform(data_frame: DataFrame):
output_df = data_frame.withColumn("account_type", getAccountTypeUDF(data_frame.eventkey)) \
.withColumn("counter_type", getCounterTypeUDF(data_frame.eventkey)) \
.withColumn("billable_item_sid", getBiSidUDF(data_frame.eventkey))
output_df.drop("eventkey")
return output_df
I've created UDF functions to retrieve the account_type, counter_type and billable_item_sid from a given eventkey value. I have a class called EventKey that takes the full eventkey string as a constructor param, and creates an object with data members to access the account_type, counter_type and billable_item_sid.
getAccountTypeUDF = udf(lambda x: get_account_type(x))
getCounterTypeUDF = udf(lambda x: get_counter_type(x))
getBiSidUDF = udf(lambda x: get_billable_item_sid(x))
def get_account_type(event_key: str):
event_key_obj = EventKey(event_key)
return event_key_obj.account_type.name
def get_counter_type(event_key: str):
event_key_obj = EventKey(event_key)
return event_key_obj.counter_type
def get_billable_item_sid(event_key: str):
event_key_obj = EventKey(event_key)
return event_key_obj.billable_item_sid
The issue that I'm running into is that a billable_item_sid can be null, but when I attempt to call withColumn with a None, the entire frame drops the column when I attempt to aggregate the data later. Is there a way to create a new column with a Null value using withColumn and a UDF?
Things I've tried (for testing purposes):
.withColumn("billable_item_sid", lit(getBiSidUDF(data_frame.eventkey)))
.withColumn("billable_item_sid", lit(None).castString())
Tried a when/otherwise condition for billable_item_sid for null checking
Found out the issue was caused when writing the DataFrame to json.Fixed this by upgrading pyspark to 3.1.1, which has a called ignoreNullFields=False

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable

I am getting this error after running the function below.
PicklingError: Could not serialize object: Exception: It appears that
you are attempting to reference SparkContext from a broadcast
variable, action, or transformation. SparkContext can only be used on
the driver, not in code that it run on workers. For more information,
see SPARK-5063.
The objective of this piece of code is to create a flag for every row based on the date differences. Multiple rows per user are supplied to the function to create the values of the flag.
Does anyone know what this means and how I can fix it?
def selection(df):
_i = 0
l = []
x = df.select("Date").orderBy("Date").rdd.flatMap(lambda x: x).collect()
rdd = spark.sparkContext.parallelize(x)
l.append((x[_i],1))
for _j in range(_i+1,rdd.count()):
if((d2.year - d1.year) * 12 + (d2.month - d1.month) >= 2 ):
l.append((x[_j],1))
else:
l.append((x[_j],0))
continue
_i=_j
columns = ['Date','flag']
new_df = spark.createDataFrame(l, columns)
df_new = df.join(new_df,['Date'],"inner")
return df_new
ToKeep = udf(lambda z: selection(z))
sample_new = sample.withColumn("toKeep",ToKeep(sample).over(Window.partitionBy(col("id")).orderBy(col("id"),col("Date"))))
#udf (stringType())
def function(x):
#could not have spark operation(such as spark.sql)
#if it contains spark operation ,must create new sparkContext
pass

Is it possible to pass an extra argument to lambda function in pandas read_csv

I am using the read_csv() function from pandas and the option for a lambda date_parser function quit often and I am wondering if it is possible to pass an argument to this labda function.
This is a minimal example where I set the format_string:
import pandas as pd
def date_parser_1(value, format_string='%Y.%m.%d %H:%M:%S'):
return pd.to_datetime(value, format=format_string)
df = pd.read_csv(file,
parse_dates=[1],
date_parser=date_parser_1 #args('%Y-%m-%d %H:%M:%S')
)
print(df)
I do know, that pandas has a infer_datetime_format flag, but this is question is only looking for a self defined date_parser.
Welcome to the magic of partial functions.
def outer(outer_arg):
def inner(inner_arg):
return outer_arg * inner_arg
return inner
fn = outer(5)
print(fn(3))
Basically you define your function inside a function and return that inner function as the result. In this case I call outer(5) which means I now have a function assigned to fn that I can call lots of times, each time it will execute the inner function, but with the outer_arg in the closure.
So in your case:
def dp1_wrapper(format_string):
def date_parser_1(value):
return pd.to_datetime(value, format=format_string)
return date_parser_1
df = pd.read_csv(file,
parse_dates=[1],
date_parser=dp1_wrapper('%Y.%m.%d %H:%M:%S')
)
Once you know how this works, there is a shortcut utility:
from functools import partial
df = pd.read_csv(file,
parse_dates=[1],
date_parser=partial(date_parser_1, format='%Y.%m.%d %H:%M:%S')
)

create PySpark Dataframe column based on class method

I have a python class and it has functions like below:
class Features():
def __init__(self, json):
self.json = json
def get_email(self):
email = self.json.get('fields', {}).get('email', None)
return email
And I am trying to use the get_email function in a pyspark dataframe to create a new column based on another column, "raw_json",which consists of json value:
df = data.withColumn('email', (F.udf(lambda j: Features.get_email(json.loads(j)), t.StringType()))('raw_json'))
So the ideal pyspark dataframe looks like below:
+---------------+-----------
|raw_json |email
+----------------+----------
| |
+----------------+--------
| |
+----------------+-------
But I am getting an error saying:
TypeError: unbound method get_email() must be called with Features instance as first argument (got dict instance instead)
How should I do to achieve this?
I have seen a similar question asked before but it was not resolved.
I guess you have misunderstood how classes are used in Python. You're probably looking for this instead:
udf = F.udf(lambda j: Features(json.loads(j)).get_email())
df = data.withColumn('email', udf('raw_json'))
where you instantiate a Features object and call the get_email method of the object.

pyspark udf print row being analyzed

I have a problem inside a pyspark udf function and I want to print the number of the row generating the problem.
I tried to count the rows using the equivalent of "static variable" in Python so that when the udf is called with a new row, a counter is incremented. However, it is not working:
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
print(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())
How can I count the number of times a udf is called in order to find the number of the row generating the problem in pyspark?
UDFs are executed at workers, so the print statements inside them won't show up in the output (which is from the driver). The best way to handle issues with UDFs is to change the return type of the UDF to a struct or a list and pass the error information along with the returned output. In the code below I am just adding the error info to the string res that you were returning originally.
import pyspark.sql.functions as F
def myF(input):
myF.lineNumber += 1
if (somethingBad):
res += 'Error in line {}'.format(myF.lineNumber)
return res
myF.lineNumber = 0
myF_udf = F.udf(myF, StringType())

Categories

Resources