I came across- the below lambda code line in PySpark while browsing a long python Jupyter notebook, I am trying to understand this piece of line. Can you explain what it does in a best possible way?
parse = udf (lambda x: (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z', StringType())
udf(
lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z',
StringType()
)
udf in PySpark assigns a Python function which is run for every row of Spark df.
Creates a user defined function (UDF).
New in version 1.3.0.
Parameters:
f : function
python function if used as a standalone function
returnType : pyspark.sql.types.DataType or str
the return type of the user-defined function. The value can be either a pyspark.sql.types.DataType object or a DDL-formatted type string.
The returnType will be a string. Removing it, we get the function body we're interested in:
lambda x: (datetime.datetime.utcnow() - timedelta(hours=x)).isoformat()[:-3] + 'Z'
In order to find out what the given lambda function does, you can create a regular function from it. You may need to add imports too.
import datetime
from datetime import timedelta
def func(x):
return (datetime.datetime.utcnow() - timedelta(hours= x)).isoformat()[:-3] + 'Z'
To really see what's going on you can create variables out of every element and print them.
import datetime
from datetime import timedelta
def my_func(x):
v1 = datetime.datetime.utcnow()
v2 = timedelta(hours=x)
v3 = v1 - v2
v4 = v3.isoformat()
v5 = v4[:-3]
v6 = v5 + 'Z'
[print(e) for e in (v1, v2, v3, v4, v5)]
return v6
print(my_func(3))
# 2022-06-17 07:16:36.212566
# 3:00:00
# 2022-06-17 04:16:36.212566
# 2022-06-17T04:16:36.212566
# 2022-06-17T04:16:36.212
# 2022-06-17T04:16:36.212Z
This way you see how result changes after every step. You can print whatever you want at any step you need. E.g. print(type(v4))
Related
I'm still in a learning stage of python. In the following example (taken from Method 3 of this article), the name of the User Defined Function (UDF) is Total(...,...). But the author is calling it with a name new_f(...,...).
Question: In the code below, how do we know that the function call new_f(...,...) should call the function Total(...,...)? What if there was another UDF function, say, Sum(...,...). In that case, how the code would have known whether call new_f(...,...) means calling Total(...,...) or Sum(...,...)?
# import the functions as F from pyspark.sql
import pyspark.sql.functions as F
from pyspark.sql.types import IntegerType
# define the sum_col
def Total(Course_Fees, Discount):
res = Course_Fees - Discount
return res
# integer datatype is defined
new_f = F.udf(Total, IntegerType())
# calling and creating the new
# col as udf_method_sum
new_df = df.withColumn(
"Total_price", new_f("Course_Fees", "Discount"))
# Showing the Dataframe
new_df.show()
new_f = F.udf(Total, IntegerType())
assigns the name new_f to that user defined function
My situation is that I'm receiving transaction data from a vendor that has a datetime that is in local time but it has no offset. For example, the ModifiedDate column may have a value of
'2020-05-16T15:04:55.7429192+00:00'
I can get the local timezone by pulling some other data together about the store in which the transaction occurs
timezone_local = tz.timezone(tzDf[0]["COUNTRY"] + '/' + tzDf[0]["TIMEZONE"])
I then wrote a function to take those two values and give it the proper timezone:
from datetime import datetime
import dateutil.parser as parser
import pytz as tz
def convert_naive_to_aware(datetime_local_str, timezone_local):
yy = parser.parse(datetime_local_str).year
mm = parser.parse(datetime_local_str).month
dd = parser.parse(datetime_local_str).day
hh = parser.parse(datetime_local_str).hour
mm = parser.parse(datetime_local_str).minute
ss = parser.parse(datetime_local_str).second
# ms = parser.parse(datetime_local_str).microsecond
# print('yy:' + str(yy) + ', mm:' + str(mm) + ', dd:' + str(dd) + ', hh:' + str(hh) + ', mm:' + str(mm) + ', ss:' + str(ss))
aware = datetime(yy,mm,dd,hh,mm,ss,0,timezone_local)
return aware
It works fine when I send it the timestamp as a string in testing but balks when I try to apply it to a dataframe. I presume because I don't yet know the right way to pass the column value as a string. In this case, I'm trying to replace the current ModifiedTime value with the results of the call to the function.
from pyspark.sql import functions as F
.
.
.
ordersDf = ordersDf.withColumn("ModifiedTime", ( convert_naive_to_aware( F.substring( ordersDf.ModifiedTime, 1, 19 ), timezone_local)),)
Those of you more knowledgeable than I won't be surprised that I received the following error:
TypeError: 'Column' object is not callable
I admit, I'm a bit of a tyro at python and dataframes and I may well be taking the long way 'round. I've attempted a few other things such as ordersDf.ModifiedTime.cast("String"), etc but no luck I'd be grateful for any suggestions.
We're using Azure Databricks, the cluster is Scala 2.11.
You need to convert the function into a UDF before you can apply it on a Spark dataframe:
from pyspark.sql import functions as F
# I assume `tzDf` is a pandas dataframe... This syntax wouldn't work with spark.
timezone_local = tz.timezone(tzDf[0]["COUNTRY"] + '/' + tzDf[0]["TIMEZONE"])
# Convert function to UDF
time_udf = F.udf(convert_naive_to_aware)
# Avoid overwriting dataframe variables. Here I appended `2` to the new variable name.
ordersDf2 = ordersDf.withColumn(
"ModifiedTime",
convert_naive_to_aware(
F.substring(ordersDf.ModifiedTime, 1, 19), F.lit(str(timezone_local))
)
)
I have a properly working function and would like to add the naming of DF while looping.
There is a function:
def(function1)
v0 = (x,y,z)
v1 = (aa,bb,cc)
for link, name in zip(v0,v1)
df = function1(v0)
It seems there is an issue as I cannot pass the variable from the loop to the data frame name.
The result I want to achieve:
df.aa from function1(x)
df.bb from function1(y)
df.cc from function1(z)
If I understand correctly, you would want to use a dictionary to store named results of a function call
def foo(x):
return some_dataframe
v0 = (x,y,z)
v1 = ('aa' , 'bb' , 'cc')
data = dict()
for v, name in zip(v0,v1):
data[name] = foo(v)
from pyspark.sql import functions as func
I have a Pyspark Dataframe, which is called df. It has the following schema:
id: string
item: string
data: double
I apply on it the following operation:
grouped_df = df.groupBy(["id", "item"]).agg(func.collect_list(df.data).alias("dataList"))
Also, i defined the user defined function iqrOnList:
#udf
def iqrOnList(accumulatorsList: list):
import numpy as np
Q1 = np.percentile(accumulatorsList, 25)
Q3 = np.percentile(accumulatorsList, 75)
IQR = Q3 - Q1
lowerFence = Q1 - (1.5 * IQR)
upperFence = Q3 + (1.5 * IQR)
return [elem if (elem >= lowerFence and elem <= upperFence) else None for elem in accumulatorsList]
I used this UDF in this way:
grouped_df = grouped_df.withColumn("SecondList", iqrOnList(grouped_df.dataList))
Those operations return in output the dataframe grouped_df, which is like this:
id: string
item: string
dataList: array
SecondList: string
Problem:
SecondList has exactly the correct value i expect (for example [1, 2, 3, null, 3, null, 2]), but with the wrong return type (string instead of array, even though it keeps the form of it).
The problem is I need it to be stored as an array, exactly as dataList is.
Questions:
1) How can i save it with the correct type?
2) This UDF is expensive in term of performance.
I read here that Pandas UDF's performance are way better than common UDF. What is the equivalent of this method in Pandas UDF?
Bonus question (less priority): func.collect_list(df.data) does not collect null values, which df.data has. I'd like to collect them too, how can i do without replacing all null values with another default value?
You can still use your current syntax, just need to provide return type in annotation declaration
import pyspark.sql.types as Types
#udf(returnType=Types.ArrayType(Types.DoubleType()))
FutureWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method.
I am getting the above error whenever i ran this code!
difference = pd.Panel(dict(df1=df1,df2=df2))
Can anyone please tell me the alternative way for usage of Panel with the above line of code.
Edit-1:-
def report_diff(x):
return x[0] if x[0] == x[1] else '{} ---> {}'.format(*x)
difference = pd.Panel(dict(df1=df1,df2=df2))
res = difference.apply(report_diff, axis=0)
Here df1 and df2 contains both categorical and numerical data.
Just comparing the two dataframes here to get the differences between the two.
As stated in the docs, the recommended replacements for a Pandas Panel are using a multindex, or the xarray library.
For your specific use case, this somewhat hacky code gets you the same result:
a = df1.values.reshape(df1.shape[0] * df1.shape[1])
b = df2.values.reshape(df2.shape[0] * df2.shape[1])
res = np.array([v if v == b[idx] else str(v) + '--->' + str(b[idx]) for idx, v in enumerate(a)]).reshape(
df1.shape[0], df1.shape[1])
res = pd.DataFrame(res, columns=df1.columns)