Unable to use Pandas UDF in Databricks - python

I have to run a script that takes a few arguments as input and returns some results as output, so first I developed it in my local machine - working fine - and my goal now is running it in Databricks in order to parallelize it.
The issue comes when I'm trying to parallelize it. I'm taking the data from a Datalake already mounted (the issue is not there as I'm able to print the DataFrame after reading it), transforming it to a Spark DataFrame and passing each row to the main function grouped by material:
import pandas as pd
import os
import numpy as np
import scipy.stats as stats
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType,StructField,IntegerType,FloatType
# Pandas udf
schema = StructType([StructField('Material', IntegerType(), True),
StructField('Alpha', IntegerType(), True),
StructField('Beta', IntegerType(), True),
StructField('Sales', IntegerType(), True),
StructField('SL', FloatType(), True)])
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def main(data):
material = data['Material'].iloc[0]
print(material) #<-------- THIS IS NOT PRINTING
print('Hello world') #<------ NEITHER IS THIS
start = data['start '].iloc[0]
end = data['end '].iloc[0]
mu_lt = data['mu_lt'].iloc[0]
sigma_lt = data['sigma_lt'].iloc[0]
df = pd.DataFrame(columns=('Material', 'Alpha', 'Beta', 'Sales', 'SL'))
for beta in range(1, 2):
for alpha in range(3, 5):
# Do stuff
return df
if __name__ == '__main__':
spark = SparkSession.builder.getOrCreate()
params = pd.read_csv('/dbfs/mnt/input/params_input.csv')
params_spark = spark.createDataFrame(params)
params_spark.groupby('Material').apply(main).show()
I'm not sure if I'm passing correctly the DF to the main function or even declaring it right, but none of the prints nor the DF defined in the main function seem to be running. The code throws no error, but no output is returned either.

Try this:
#pandas_udf('y int, ds int, store_id string, product_id string, log string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
return pd.DataFrame([3, 5, 'store123', 'product123', 'My log message'], columns=['y', 'ds','store_id','product_id', 'log'])

Related

Issue in executing pyspark udf for simple time series forecasting

I'm writing a pyspark (2.4.8) job for performing time series predictions on a lot of models. I've a pandas df containing the model_ids and their corresponding path where the pickle files are present. I'll need to generate predictions using all these models.
My approach is to first convert the pandas df to a spark df. Then I'm planning to execute an udf per model-artifact-path to get the predictions, which are a simple list of integers.
Below is my high-level code:
import pyspark.sql.functions as F
def inference_arima_with_udf(spark_session, model_metadata_pandas_df, start_date, end_date):
def get_preds(model_path, num_steps):
# load model from model_metadata
model = load_model(model_path)
predictions = model.predict(num_steps) # returns a numpy array of ints
return predictions
def get_schema():
schema = StructType(
[
StructField("model_id", IntegerType(), True),
StructField("model_artifacts_path", StringType(), True)
]
)
return schema
num_steps = end_date - start_date + 1
model_metadata_spark_df = spark_session.createDataFrame(model_metadata_pandas_df, get_schema())
get_preds_udf = F.udf(lambda mp: get_preds(mp, num_steps), ArrayType(IntegerType()))
model_metadata_spark_df = model_metadata_spark_df.withColumn('forecasts', get_preds_udf(F.col('model_artifacts_path'),
F.lit(num_steps)))
return model_metadata_spark_df
Problem is, once I execute this function, it returns this error:
TypeError: () takes 1 positional argument but 2 were given
I've spent quite some time on this, but no luck so far. Any help to point me in the right direction is much appreciated.

pyspark UDF returns AttributeError: 'DataFrame' object has no attribute 'sort_values'

I'm having a hard time with my program, I'm trying to apply a UDF to a dataframe and getting an error msg as per my title. Here is my code
import pandas as pd
import datetime as dt
import numpy as np
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
df = pd.DataFrame({
'ID':[1,2,2],
'dt':[pd.Timestamp.now(),pd.Timestamp.now(),
pd.Timestamp.now()]})
df.head()
def FlagUsers(df,ids,tm,gap):
df=df.sort_values([ids,tm])
df[ids]=df[ids].astype(str)
df['timediff'] = df.groupby(ids)[tm].diff()
df['prevtime']= df.groupby (ids)[tm].shift()
df['prevuser']= df[ids].shift()
df['prevuser'].fillna(0,inplace=True)
df['timediff']=df.timediff/ pd.Timedelta('1 minute')
df['timediff'].fillna(99,inplace=True)
df['flagnew']=np.where((df.timediff<gap) & (df['prevuser']==df[ids]),'existing','new' )
df.loc[df.flagnew == 'new','sessnum'] = df.groupby([ids,'flagnew']).cumcount()+1
df['sessnum']=df['sessnum'].fillna(method='ffill')
df['session_key']= df[ids].astype(str)+"_"+df['sessnum'].astype(str)
df.drop(['prevtime', 'prevuser'], axis =1, inplace= True)
arr=df['session_key'].values
return arr
# Python Function works fine:
FlagUsers(df,'ID','dt',5)
s_df = spark.createDataFrame(df)
s_df.show()
spark.udf.register("FlagUsers", FlagUsers)
s_df = s_df.withColumn('session_key',FlagUsers(s_df,'ID','dt',5))
My function works fine in python but when i try to run it in Spark it does not work? i'm really sorry if this is a silly question! Thank you & Best Wishes
pyspark udf is not same as other native python udf, it has specific requirements,
Please experiment with pandas udf.
it is several times faster and better
https://docs.databricks.com/spark/latest/spark-sql/udf-python-pandas.html

how can I assign a row with Pyspark Dataframe?

Can you please convert this expression below from Pandas to Pyspark Dataframe, I try to see the equivalent of the loc in Pyspark?
import pandas as pd
df3 = pd.DataFrame(columns=["Devices","months"])
new_entry = {'Devices': 'device1', 'months': 'month1'}
df3.loc[len(df3)] = new_entry
In pyspark you need to union to add a new row to an existing data frame. But Spark data frame are unordered and there no index as in pandas so there no such equivalent. For the example you gave, you write it like this in Pyspark :
from pyspark.sql.types import *
schema = StructType([
StructField('Devices', StringType(), True),
StructField('months', TimestampType(), True)
])
df = spark.createDataFrame(sc.emptyRDD(), schema)
new_row_df = spark.createDataFrame([("device1", "month1")], ["Devices", "months"])
# or using spark.sql()
# new_row_df = spark.sql("select 'device1' as Devices, 'month1' as months")
df = df.union(new_row_df)
df.show()
#+-------+------+
#|Devices|months|
#+-------+------+
#|device1|month1|
#+-------+------+
If you want to add a row at "specific position", you can create a column index using for example row_number function by defining an ordering, then filter the row number you want to assign the new row into before doing union :
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn("rn", F.row_number().over(Window.orderBy("Devices")))
# df.loc[1] = ...
df = df.filter("rn <> 1").drop("rn").union(new_row_df)

Getting an error"TypeError: cannot unpack non-iterable float object" after excuting script in ApacheSpark. can anyone please debug my code?

Hi I am following frank kane's course on apachespark with python. Here i am trying to calculate total amount spent by different customers.I have mentioned the error below.Kindly help. following is my code:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("MaxTemperatures")
sc = SparkContext(conf = conf)
def parseline(lines):
fields=lines.split(',')
customerId=int(fields[0])
dollars=float(fields[2])
return (customerId, dollars)
text = sc.textFile("file:///Sparkcourse/SparkCourse/customer-orders.csv")
rdd= text.map(parseline)
reduction= rdd.map(lambda x: x[1]).reduceByKey(lambda x,y: x+y)
sortedvalues=reduction.sortByKey()
final= sortedvalues.collect()
for i,j in final:
print(i,j)
TypeError: cannot unpack non-iterable float object
I am not sure about what you want to do and I have the intuition that you have at least one error in your code: you should use reduceByKey the following way:
reduceByKey(lambda x, y: x + y)
I think in your case, every steps you want to use can be translated in the framework of the DataFrame API which will be easier to use. RDDs are not the easiest structure to handle when you use simple operations as sums, etc. (and DataFrames will be faster)
So I can propose you something like that. You will probably have to change the schema statement to match your csv structure. Assuming your spark session is named spark:
import pyspark.sql.types as pst
import pyspark.sql.functions as psf
schema = pst.StructType([
pst.StructField("customerId", pst.IntegerType(), True),
pst.StructField("dollars", pst.IntegerType(), True),
pst.StructField("productid", pst.IntegerType(), True)])
(spark.read
.csv("file:///Sparkcourse/SparkCourse/customer-orders.csv", header = False, schema = schema)
.groupBy('customerId')
.agg(psf.sum("dollars").alias("dollars"))
.sortBy('dollars')
)

pyspark dataframe "condition should be string or Column"

i am unable to use a filter on a data frame. i keep getting error "TypeError("condition should be string or Column")"
I have tried changing the filter to use col object. Still, it does not work.
path = 'dbfs:/FileStore/tables/TravelData.txt'
data = spark.read.text(path)
from pyspark.sql.types import StructType, StructField, IntegerType , StringType, DoubleType
schema = StructType([
StructField("fromLocation", StringType(), True),
StructField("toLocation", StringType(), True),
StructField("productType", IntegerType(), True)
])
df = spark.read.option("delimiter", "\t").csv(path, header=False, schema=schema)
from pyspark.sql.functions import col
answerthree = df.select("toLocation").groupBy("toLocation").count().sort("count", ascending=False).take(10) # works fine
display(answerthree)
I add a filter to variable "answerthree" as follows:
answerthree = df.select("toLocation").groupBy("toLocation").count().filter(col("productType")==1).sort("count", ascending=False).take(10)
It is throwing error as follows:
""cannot resolve 'productType' given input columns""condition should be string or Column"
In jist, i am trying to solve problem 3 given in below link using pyspark instead of scal. Dataset is also provided in below url.
https://acadgild.com/blog/spark-use-case-travel-data-analysis?fbclid=IwAR0fgLr-8aHVBsSO_yWNzeyh7CoiGraFEGddahDmDixic6wmumFwUlLgQ2c
I should be able to get the desired result only for productType value 1
As you don't have a variable referencing the data frame, the easiest is to use a string condition:
answerthree = df.select("toLocation").groupBy("toLocation").count()\
.filter("productType = 1")\
.sort(...
Alternatively, you can use a data frame variable and use a column-based filter:
count_df = df.select("toLocation").groupBy("toLocation").count()
answerthree = count_df.filter(count_df['productType'] == 1)\
.sort("count", ascending=False).take(10)

Categories

Resources