I'm writing a pyspark (2.4.8) job for performing time series predictions on a lot of models. I've a pandas df containing the model_ids and their corresponding path where the pickle files are present. I'll need to generate predictions using all these models.
My approach is to first convert the pandas df to a spark df. Then I'm planning to execute an udf per model-artifact-path to get the predictions, which are a simple list of integers.
Below is my high-level code:
import pyspark.sql.functions as F
def inference_arima_with_udf(spark_session, model_metadata_pandas_df, start_date, end_date):
def get_preds(model_path, num_steps):
# load model from model_metadata
model = load_model(model_path)
predictions = model.predict(num_steps) # returns a numpy array of ints
return predictions
def get_schema():
schema = StructType(
[
StructField("model_id", IntegerType(), True),
StructField("model_artifacts_path", StringType(), True)
]
)
return schema
num_steps = end_date - start_date + 1
model_metadata_spark_df = spark_session.createDataFrame(model_metadata_pandas_df, get_schema())
get_preds_udf = F.udf(lambda mp: get_preds(mp, num_steps), ArrayType(IntegerType()))
model_metadata_spark_df = model_metadata_spark_df.withColumn('forecasts', get_preds_udf(F.col('model_artifacts_path'),
F.lit(num_steps)))
return model_metadata_spark_df
Problem is, once I execute this function, it returns this error:
TypeError: () takes 1 positional argument but 2 were given
I've spent quite some time on this, but no luck so far. Any help to point me in the right direction is much appreciated.
Related
I'm using auto_arima via pmdarima to fit multiple time series via a groupby. This is to say, I have a pd.DataFrame of stacked time-indexed data, grouped by variable variable, and have successfully applied transform(pm.auto_arima) to each. The reproducible example finds boring best ARIMA models, but the idea seems to work. I now want to apply .predict() similarly, but cannot get it to play nice with apply / lambda(x) / their combinations.
The code below works until the # Forecasting - help! section. I'm having trouble catching the correct object (apparently) in the apply. How might I adapt one of test1, test2, or test3 to get what I want? Or, is there some other best-practice construct to consider? Is it better across columns (without a melt)? Or via a loop?
Ultimately, I hope that test1, say, is a stacked pd.DataFrame (or pd.Series at least) with 8 rows: 4 forecasted values for each of the 2 time series in this example, with an identifier column variable (possibly tacked on after the fact).
import pandas as pd
import pmdarima as pm
import itertools
# Get data - this is OK.
url = 'https://raw.githubusercontent.com/nickdcox/learn-airline-delays/main/delays_2018.csv'
keep = ['arr_flights', 'arr_cancelled']
# Setup data - this is OK.
df = pd.read_csv(url, index_col=0)
df.index = pd.to_datetime(df.index, format = "%Y-%m")
df = df[keep]
df = df.sort_index()
df = df.loc['2018']
df = df.groupby(df.index).sum()
df.reset_index(inplace = True)
df = df.melt(id_vars = 'date', value_vars = df.columns.to_list()[1:])
# Fit auto.arima for each time series - this is OK.
fit = df.groupby('variable')['value'].transform(pm.auto_arima).drop_duplicates()
fit = fit.to_frame(name = 'model')
fit['variable'] = keep
fit.reset_index(drop = True, inplace = True)
# Setup forecasts - this is OK.
max_date = df.date.max()
dr = pd.to_datetime(pd.date_range(max_date, periods = 4 + 1, freq = 'MS').tolist()[1:])
yhat = pd.DataFrame(list(itertools.product(keep, dr)), columns = ['variable', 'date'])
yhat.set_index('date', inplace = True)
# Forecasting - help! - Can't get any of these to work.
def predict_fn(obj):
return(obj.loc[0].predict(4))
predict_fn(fit.loc[fit['variable'] == 'arr_flights']['model']) # Appears to work!
test1 = fit.groupby('variable')['model'].apply(lambda x: x.predict(n_periods = 4)) # Try 1: 'Series' object has no attribute 'predict'.
test2 = fit.groupby('variable')['model'].apply(lambda x: x.loc[0].predict(n_periods = 4)) # Try 2: KeyError
test3 = fit.groupby('variable')['model'].apply(predict_fn) # Try 3: KeyError
I have to run a script that takes a few arguments as input and returns some results as output, so first I developed it in my local machine - working fine - and my goal now is running it in Databricks in order to parallelize it.
The issue comes when I'm trying to parallelize it. I'm taking the data from a Datalake already mounted (the issue is not there as I'm able to print the DataFrame after reading it), transforming it to a Spark DataFrame and passing each row to the main function grouped by material:
import pandas as pd
import os
import numpy as np
import scipy.stats as stats
from pyspark.sql import SparkSession
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType,StructField,IntegerType,FloatType
# Pandas udf
schema = StructType([StructField('Material', IntegerType(), True),
StructField('Alpha', IntegerType(), True),
StructField('Beta', IntegerType(), True),
StructField('Sales', IntegerType(), True),
StructField('SL', FloatType(), True)])
#pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def main(data):
material = data['Material'].iloc[0]
print(material) #<-------- THIS IS NOT PRINTING
print('Hello world') #<------ NEITHER IS THIS
start = data['start '].iloc[0]
end = data['end '].iloc[0]
mu_lt = data['mu_lt'].iloc[0]
sigma_lt = data['sigma_lt'].iloc[0]
df = pd.DataFrame(columns=('Material', 'Alpha', 'Beta', 'Sales', 'SL'))
for beta in range(1, 2):
for alpha in range(3, 5):
# Do stuff
return df
if __name__ == '__main__':
spark = SparkSession.builder.getOrCreate()
params = pd.read_csv('/dbfs/mnt/input/params_input.csv')
params_spark = spark.createDataFrame(params)
params_spark.groupby('Material').apply(main).show()
I'm not sure if I'm passing correctly the DF to the main function or even declaring it right, but none of the prints nor the DF defined in the main function seem to be running. The code throws no error, but no output is returned either.
Try this:
#pandas_udf('y int, ds int, store_id string, product_id string, log string', PandasUDFType.GROUPED_MAP)
def train_predict(pdf):
return pd.DataFrame([3, 5, 'store123', 'product123', 'My log message'], columns=['y', 'ds','store_id','product_id', 'log'])
Hi I am following frank kane's course on apachespark with python. Here i am trying to calculate total amount spent by different customers.I have mentioned the error below.Kindly help. following is my code:
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("MaxTemperatures")
sc = SparkContext(conf = conf)
def parseline(lines):
fields=lines.split(',')
customerId=int(fields[0])
dollars=float(fields[2])
return (customerId, dollars)
text = sc.textFile("file:///Sparkcourse/SparkCourse/customer-orders.csv")
rdd= text.map(parseline)
reduction= rdd.map(lambda x: x[1]).reduceByKey(lambda x,y: x+y)
sortedvalues=reduction.sortByKey()
final= sortedvalues.collect()
for i,j in final:
print(i,j)
TypeError: cannot unpack non-iterable float object
I am not sure about what you want to do and I have the intuition that you have at least one error in your code: you should use reduceByKey the following way:
reduceByKey(lambda x, y: x + y)
I think in your case, every steps you want to use can be translated in the framework of the DataFrame API which will be easier to use. RDDs are not the easiest structure to handle when you use simple operations as sums, etc. (and DataFrames will be faster)
So I can propose you something like that. You will probably have to change the schema statement to match your csv structure. Assuming your spark session is named spark:
import pyspark.sql.types as pst
import pyspark.sql.functions as psf
schema = pst.StructType([
pst.StructField("customerId", pst.IntegerType(), True),
pst.StructField("dollars", pst.IntegerType(), True),
pst.StructField("productid", pst.IntegerType(), True)])
(spark.read
.csv("file:///Sparkcourse/SparkCourse/customer-orders.csv", header = False, schema = schema)
.groupBy('customerId')
.agg(psf.sum("dollars").alias("dollars"))
.sortBy('dollars')
)
I have two timestamp columns in my pyspark dataframe. I want to create a third column which has the array of timestamp hours between the two timestamps.
This is the code I wrote for that..
# Creating udf function
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
dates = np.array(date_list)
return dates
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
# Using udf function
orders.withColumn('date_array', udf_betweens(F.col('start_date'), F.col('ICUDischargeDate'))).show()
However this is showing the error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I think the inputs to the functions are going in as two arrays and not as two datetimes causing the error. Is there any way around this? Any other way of solving this problem?
Thank you very much.
You are getting the error when returning numpy array from your udf. You can simply return the date_list and it will work.
def getBetweenStamps(st_date, dc_date):
import numpy as np
hr = 0
date_list = []
runnig_date = st_date
while (dc_date>runnig_date):
runnig_date = st_date+timedelta(hours=hr)
date_list.append(runnig_date)
hr+=1
return date_list
udf_betweens = F.udf(getBetweenStamps, ArrayType(DateType()))
To test the above function:
df = spark.sql("select current_timestamp() as t1").withColumn("t2", col("t1") + expr("INTERVAL 1 DAYS"))
df.withColumn('date_array', udf_betweens(F.col('t1'), F.col('t2'))).show()
So far, Spark hasn't created the DataFrame for streaming data, but when I am doing anomalies detection, it is more convenient and faster to use DataFrame for data analysis. I have done this part, but when I try to do real time anomalies detection using streaming data, the problems appeared. I tried several ways and still could not convert DStream to DataFrame, and cannot convert the RDD inside of DStream into DataFrame either.
Here's part of my latest version of the code:
import sys
import re
from pyspark import SparkContext
from pyspark.sql.context import SQLContext
from pyspark.sql import Row
from pyspark.streaming import StreamingContext
from pyspark.mllib.clustering import KMeans, KMeansModel, StreamingKMeans
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.functions import udf
import operator
sc = SparkContext(appName="test")
ssc = StreamingContext(sc, 5)
sqlContext = SQLContext(sc)
model_inputs = sys.argv[1]
def streamrdd_to_df(srdd):
sdf = sqlContext.createDataFrame(srdd)
sdf.show(n=2, truncate=False)
return sdf
def main():
indata = ssc.socketTextStream(sys.argv[2], int(sys.argv[3]))
inrdd = indata.map(lambda r: get_tuple(r))
Features = Row('rawFeatures')
features_rdd = inrdd.map(lambda r: Features(r))
features_rdd.pprint(num=3)
streaming_df = features_rdd.flatMap(streamrdd_to_df)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
As you can see in the main() function, when I am reading the input streaming data using ssc.socketTextStream() method, it generates DStream, then I tried to convert each individual in DStream into Row, hoping I could convert the data into DataFrame later.
If I use ppprint() to print out features_rdd here, it works, which makes me think, each individual in features_rdd is a batch of RDD while the whole features_rdd is a DStream.
Then I created streamrdd_to_df() method and hoped to convert each batch of RDD into dataframe, it gives me the error, showing:
ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
Is there any thought about how can I do DataFrame operations on Spark streaming data?
Spark has provided us with structured streaming which can solve such problems. It can generate streaming DataFrame i.e DataFrames being appended continuously. Please check below link
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html
Read the Error carefully..It says there is No output operations registered. Spark is Lazy and executes the job/ cod only when it has something to produce as a result. In your program there is no "Output Operation" and same is being complained by Spark.
Define a foreach() or Raw SQL Query over the DataFrame and then print the results. It will work fine.
Why don't you use something like this:
def socket_streamer(sc): # retruns a streamed dataframe
streamer = session.readStream\
.format("socket") \
.option("host", "localhost") \
.option("port", 9999) \
.load()
return streamer
The output itself of this function above (or the readStream in general) is a DataFrame. There you don't need to worry about df, it is already automatically created by spark.
See the Spark Structured Streaming Programming Guide
After 1 year, I started to explore Spark 2.0 streaming methods and finally solved my anomalies detection problem. Here's my code in IPython, you can also find how does my raw data input look like
There is no need to convert DStream into RDD. By definition DStream is a collection of RDD. Just use DStream's method foreach() to loop over each RDD and take action.
val conf = new SparkConf()
.setAppName("Sample")
val spark = SparkSession.builder.config(conf).getOrCreate()
sampleStream.foreachRDD(rdd => {
val sampleDataFrame = spark.read.json(rdd)
}
The spark documentation has an introduction to working with DStream. Basically, you have to use foreachRDD on your stream object to interact with it.
Here is an example (ensure you create a spark session object):
def process_stream(record, spark):
if not record.isEmpty():
df = spark.createDataFrame(record)
df.show()
def main():
sc = SparkContext(appName="PysparkStreaming")
spark = SparkSession(sc)
ssc = StreamingContext(sc, 5)
dstream = ssc.textFileStream(folder_path)
transformed_dstream = # transformations
transformed_dstream.foreachRDD(lambda rdd: process_stream(rdd, spark))
# ^^^^^^^^^^
ssc.start()
ssc.awaitTermination()
With Spark 2.3 / Python 3 / Scala 2.11 (Using databricks) I was able to use temporary tables and a code snippet in scala (using magic in notebooks):
Python Part:
ddf.createOrReplaceTempView("TempItems")
Then on a new cell:
%scala
import java.sql.DriverManager
import org.apache.spark.sql.ForeachWriter
// Create the query to be persisted...
val tempItemsDF = spark.sql("SELECT field1, field2, field3 FROM TempItems")
val itemsQuery = tempItemsDF.writeStream.foreach(new ForeachWriter[Row]
{
def open(partitionId: Long, version: Long):Boolean = {
// Initializing DB connection / etc...
}
def process(value: Row): Unit = {
val field1 = value(0)
val field2 = value(1)
val field3 = value(2)
// Processing values ...
}
def close(errorOrNull: Throwable): Unit = {
// Closing connections etc...
}
})
val streamingQuery = itemsQuery.start()