Runtime error using MLFlow and Spark on databricks - python

Here is some model I created:
class SomeModel(mlflow.pyfunc.PythonModel):
def predict(self, context, input):
# do fancy ML stuff
# log results
pandas_df = pd.DataFrame(...insert predictions here...)
spark_df = spark.createDataFrame(pandas_df)
spark_df.write.saveAsTable('tablename', mode='append')
I'm trying to log my model in this manner by calling it later in my code:
with mlflow.start_run(run_name="SomeModel_run"):
model = SomeModel()
mlflow.pyfunc.log_model("somemodel", python_model=model)
Unfortunately it gives me this Error Message:
RuntimeError: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
The error is caused because of the line mlflow.pyfunc.log_model("somemodel", python_model=model), if I comment it out my model will make its predictions and log the results in my table.
Alternatively, removing the lines in my predict function where I call spark to create a dataframe and save the table, I am able to log my model.
How do I go about resolving this issue? I need my model to not only write to the table but also be logged

I had a similar error and I was able to resolve this by taking spark functions like "spark.createDataFrame(pandas_df)" outside of the class. If you want to read-write data using spark do it within the main function.
class SomeModel(mlflow.pyfunc.PythonModel):
def predict(self, context, input):
# do fancy ML stuff
# log results
return predictions
with mlflow.start_run(run_name="SomeModel_run"):
model = SomeModel()
pandas_df = pd.DataFrame(...insert predictions here...)
mlflow.pyfunc.log_model("somemodel", python_model=model)

Related

SwishImplementation error when saving jit trace

I am trying to jit trace and save my pytorch model from the segmentation models package. But I am getting an error. "Could not export Python function call 'SwishImplementation'. Remove calls to python functions before export. Did you forget to add #script or #scrript_method annotation? If this is a nn.ModuleList, add it to _ constants _" It only happens when I use the efficientnet backbone. How can I get the save() function to work? I need to be able to use the model in a c++ application.
import torch
import segmentation_models_pytorch as smp
model = smp.Unet('efficientnet-b7')
model.eval()
input = torch.randn((1,3,224,224))
torch_out = model(input)
model = torch.jit.trace(model,input)
trace_out = model(input)
model.save('model.pt')
The UNET model from the segmentation_models_pytorch module uses an EfficientNet, which uses a MemoryEfficientSwish module. To fix the error, change all instances of MemoryEfficientSwish to Swish before saving the model.
You can iterate through the UNET model, and if the module is an instance of EfficientNet, call the function .set_swish(memory_efficient = False).
After that, you can load the state_dict, and then trace and save the model.

Use mlflow to serve a custom python model for scoring

I am using Python code generated from an ML software with mlflow to read a data frame, perform some table operations and output a data frame. I am able to run the code successfully and save the new data frame as an artifact. However, I am unable to log the model using log_model because it is not an LR or classifier model where we train and fit. I want to log a model for this so that it can be served with new data and deployed with a rest API
df = pd.read_csv(r"/home/xxxx.csv")
with mlflow.start_run():
def getPrediction(row):
perform_some_python_operations
return [Status_prediction, Status_0_probability, Status_1_probability]
columnValues = []
for column in columns:
columnValues.append([])
for index, row in df.iterrows():
results = getPrediction(row)
for n in range(len(results)):
columnValues[n].append(results[n])
for n in range(len(columns)):
df[columns[n]] = columnValues[n]
df.to_csv('dataset_statistics.csv')
mlflow.log_artifact('dataset_statistics.csv')
MLflow supports custom models of mlflow.pyfunc flavor. You can create a custom class inherited from the mlflow.pyfunc.PythonModel, that needs to provide function predict for performing predictions, and optional load_context to load the necessary artifacts, like this (adopted from the docs):
class MyModel(mlflow.pyfunc.PythonModel):
def load_context(self, context):
# load your artifacts
def predict(self, context, model_input):
return my_predict(model_input.values)
You can log to MLflow whatever artifacts you need for your models, define Conda environment if necessary, etc.
Then you can use save_model with your class to save your implementation, that could be loaded with load_model and do the predict using your model:
mlflow.pyfunc.save_model(
path=mlflow_pyfunc_model_path,
python_model=MyModel(),
artifacts=artifacts)
# Load the model in `python_function` format
loaded_model = mlflow.pyfunc.load_model(mlflow_pyfunc_model_path)

How to do inference in parallel with tensorflow saved model predictors?

Tensorflow version: 1.14
Our current setup is using tensorflow estimators to do live NER i.e. perform inference one document at a time. We have 30 different fields to extract, and we run one model per field, so got total of 30 models.
Our current setup uses python multiprocessing to do the inferences in parallel. (The inference is done on CPUs.) This approach reloads the model weights each time a prediction is made.
Using the approach mentioned here, we exported the estimator models as tf.saved_model. This works as expected in that it does not reload the weights for each request. It also works fine for a single field inference in one process, but doesn't work with multiprocessing. All the processes hang when we make the predict function (predict_fn in the linked post) call.
This post is related, but not sure how to adapt it for saved model.
Importing tensorflow individually for each of the predictors did not work either:
class SavedModelPredictor():
def __init__(self, model_path):
import tensorflow as tf
self.predictor_fn = tf.contrib.predictor.from_saved_model(model_path)
def predictor_fn(self, input_dict):
return self.predictor_fn(input_dict)
How to make tf.saved_model work with multiprocessing?
Ray Serve, ray's model serving solution, also support offline batching. You can wrap your model in Ray Serve's backend and scale it to the number replica you want.
from ray import serve
client = serve.start()
class MyTFModel:
def __init__(self, model_path):
self.model = ... # load model
#serve.accept_batch
def __call__(self, input_batch):
assert isinstance(input_batch, list)
# forward pass
self.model([item.data for item in input_batch])
# return a list of response
return [...]
client.create_backend("tf", MyTFModel,
# configure resources
ray_actor_options={"num_cpus": 2, "num_gpus": 1},
# configure replicas
config={
"num_replicas": 2,
"max_batch_size": 24,
"batch_wait_timeout": 0.5
}
)
client.create_endpoint("tf", backend="tf")
handle = serve.get_handle("tf")
# perform inference on a list of input
futures = [handle.remote(data) for data in fields]
result = ray.get(futures)
Try it out with the nightly wheel and here's the tutorial: https://docs.ray.io/en/master/serve/tutorials/batch.html
Edit: updated the code sample for Ray 1.0
Ok, so the approach outlined in this answer with ray worked.
Built a class like this, which loads the model on init and exposes a function run to perform prediction:
import tensorflow as tf
import ray
ray.init()
#ray.remote
class MyModel(object):
def __init__(self, field, saved_model_path):
self.field = field
# load the model once in the constructor
self.predictor_fn = tf.contrib.predictor.from_saved_model(saved_model_path)
def run(self, df_feature, *args):
# ...
# code to perform prediction using self.predictor_fn
# ...
return self.field, list_pred_string, list_pred_proba
Then used the above in the main module as:
# form a dictionary with key 'field' and value MyModel
model_dict = {}
for field in fields:
export_dir = f"saved_model/{field}"
subdirs = [x for x in Path(export_dir).iterdir()
if x.is_dir() and 'temp' not in str(x)]
latest = str(sorted(subdirs)[-1])
model_dict[field] = MyModel.remote(field, latest)
Then used the above model dictionary to do predictions like this:
results = ray.get([model_dict[field].run.remote(df_feature) for field in fields])
Update:
While this approach works, found that running estimators in parallel with multiprocessing is faster than running predictors in parallel with ray. This is especially true for large document sizes. It looks like the predictor approach might work well for small number of dimensions and when the input data is not large. Maybe an approach like mentioned here might be better for our use case.

How to run a function on all Spark workers before processing data in PySpark?

I'm running a Spark Streaming task in a cluster using YARN. Each node in the cluster runs multiple spark workers. Before the streaming starts I want to execute a "setup" function on all workers on all nodes in the cluster.
The streaming task classifies incoming messages as spam or not spam, but before it can do that it needs to download the latest pre-trained models from HDFS to local disk, like this pseudo code example:
def fetch_models():
if hadoop.version > local.version:
hadoop.download()
I've seen the following examples here on SO:
sc.parallelize().map(fetch_models)
But in Spark 1.6 parallelize() requires some data to be used, like this shitty work-around I'm doing now:
sc.parallelize(range(1, 1000)).map(fetch_models)
Just to be fairly sure that the function is run on ALL workers I set the range to 1000. I also don't exactly know how many workers are in the cluster when running.
I've read the programming documentation and googled relentlessly but I can't seem to find any way to actually just distribute anything to all workers without any data.
After this initialization phase is done, the streaming task is as usual, operating on incoming data from Kafka.
The way I'm using the models is by running a function similar to this:
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition)))
Theoretically I could check whether or not the models are up to date in the on_partition function, though it would be really wasteful to do this on each batch. I'd like to do it before Spark starts retrieving batches from Kafka, since the downloading from HDFS can take a couple of minutes...
UPDATE:
To be clear: it's not an issue on how to distribute the files or how to load them, it's about how to run an arbitrary method on all workers without operating on any data.
To clarify what actually loading models means currently:
def on_partition(config, partition):
if not MyClassifier.is_loaded():
MyClassifier.load_models(config)
handle_partition(config, partition)
While MyClassifier is something like this:
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
MyClassifier.clf = load_from_file(config)
Static methods since PySpark doesn't seem to be able to serialize classes with non-static methods (the state of the class is irrelevant with relation to another worker). Here we only have to call load_models() once, and on all future batches MyClassifier.clf will be set. This is something that should really not be done for each batch, it's a one time thing. Same with downloading the files from HDFS using fetch_models().
If all you want is to distribute a file between worker machines the simplest approach is to use SparkFiles mechanism:
some_path = ... # local file, a file in DFS, an HTTP, HTTPS or FTP URI.
sc.addFile(some_path)
and retrieve it on the workers using SparkFiles.get and standard IO tools:
from pyspark import SparkFiles
with open(SparkFiles.get(some_path)) as fw:
... # Do something
If you want to make sure that model is actually loaded the simplest approach is to load on module import. Assuming config can be used to retrieve model path:
model.py:
from pyspark import SparkFiles
config = ...
class MyClassifier:
clf = None
#staticmethod
def is_loaded():
return MyClassifier.clf is not None
#staticmethod
def load_models(config):
path = SparkFiles.get(config.get("model_file"))
MyClassifier.clf = load_from_file(path)
# Executed once per interpreter
MyClassifier.load_models(config)
main.py:
from pyspark import SparkContext
config = ...
sc = SparkContext("local", "foo")
# Executed before StreamingContext starts
sc.addFile(config.get("model_file"))
sc.addPyFile("model.py")
import model
ssc = ...
stream = ...
stream.map(model.MyClassifier.do_something).pprint()
ssc.start()
ssc.awaitTermination()
This is a typical use case for Spark's broadcast variables. Let's say fetch_models returns the models rather than saving them locally, you would do something like:
bc_models = sc.broadcast(fetch_models())
spark_partitions = config.get(ConfigKeys.SPARK_PARTITIONS)
stream.union(*create_kafka_streams())\
.repartition(spark_partitions)\
.foreachRDD(lambda rdd: rdd.foreachPartition(lambda partition: spam.on_partition(config, partition, bc_models.value)))
This does assume that your models fit in memory, on the driver and the executors.
You may be worried that broadcasting the models from the single driver to all the executors is inefficient, but it uses 'efficient broadcast algorithms' that can outperform distributing through HDFS significantly according to this analysis

Combining Spark Streaming + MLlib

I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples.
Here is the code used in pyspark:
sc = SparkContext(appName="App")
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150)
ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(hostname, int(port))
parsedLines = lines.map(parse)
parsedLines.pprint()
predictions = parsedLines.map(lambda event: model.predict(event.features))
and the error returned while compiling it in the cluster:
Error : "It appears that you are attempting to reference SparkContext from a broadcast "
Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
is there a way to use a modèle generated from a static data to predict a streaming examples ?
Thanks guys i really appreciate it !!!!
Yes, you can use model generated from static data. The problem you experience is not related to streaming at all. You simply cannot use JVM based model inside action or transformations (see How to use Java/Scala function from an action or a transformation? for an explanation why). Instead you should apply predict method to a complete RDD for example using transform on DStream:
from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from operator import attrgetter
sc = SparkContext("local[2]", "foo")
ssc = StreamingContext(sc, 1)
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
trainingData, testData = data.randomSplit([0.7, 0.3])
model = RandomForest.trainClassifier(
trainingData, numClasses=2, nmTrees=3
)
(ssc
.queueStream([testData])
# Extract features
.map(attrgetter("features"))
# Predict
.transform(lambda _, rdd: model.predict(rdd))
.pprint())
ssc.start()
ssc.awaitTerminationOrTimeout(10)

Categories

Resources