How to serve a Prophet model for my Django application? - python

I created a model with Facebook Prophet. I wonder now what is the "best" way to access these predictions from an online web application (Django).
Requirements are that I have to train/update my model on a weekly base with data from my Django application (PostgreSQL). The predictions will be saved and I want to be able to call/access this data from my Django application.
After I looked into Google Cloud and AWS I couldn't find any solution that hosts my model in a way that I can just access predictions via an API.
My best idea/approach to solve that right now:
1) Build a Flask application that trains my models on a weekly base. Predictions are saved in a PostgreSQL. The data will be a weekly CSV export from my Django web application.
2) Create an API in my Flask application, that can access predictions from the database.
3) From my Django application, I can call the API and access the data whenever needed.
I am pretty sure my approach sounds bumpy and is probably not the way how it is done. Do you have any feedback or ideas on how to solve it better? In short:
1) Predict data from a PostgresSQL database.
2) Serve predictions in a Django web application.

The simplest way to serve pre-calculated forecast values from Prophet is to serve CSV files from S3 or other file servers. You can refresh your models every few days and write the forecast output to S3
import boto3
from io import StringIO
DESTINATION = bucket_name
def write_dataframe_to_csv_on_s3(dataframe, filename):
""" Write a dataframe to a CSV on S3 """
print("Writing {} records to {}".format(len(dataframe), filename))
# Create buffer
csv_buffer = StringIO()
# Write dataframe to buffer
dataframe.to_csv(csv_buffer, sep=",", index=False)
# Create S3 object
s3_resource = boto3.resource("s3")
# Write buffer to S3 object
s3_resource.Object(DESTINATION, filename).put(Body=csv_buffer.getvalue())
results = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].copy()
write_dataframe_to_csv_on_s3(results, output+file_name+".csv")

One of the reasons I visited this question was that I wasn't sure about which way to go. The answer seems like a great alternative.
However, I didn't have many constraints on my Django application, and I was figuring out a simpler way for someone with similar use cases as mine.
My solution:
I have a Django project, a Django app that serves my website, and a Django app for the Prophet model.
The Prophet model will be re-trained each day exactly once (after some condition).
Each day the model is trained, it predicts for the new data and saves the predictions to a CSV file (which can be stored in a database). It also stores the trained model using pickle.
Now, I have access to the trained model and some pre-defined predictions by importing the Django app wherever I need it.
The project hierarchy:
project/
project/
django-app-for-website/
django-app-for-prophet/
manage.py
requirements.txt
Even though the performance of my project isn't affected much, it isn't my priority for now, but it can be yours, in which case, I wouldn't recommend this solution.
If you're looking for the simplest way to serve a Prophet model, this is what I could come up with. Just another possible solution.

Related

How to save model with cloudpickle to databricks DBFS folder and load it?

I built a model and my goal is to save the model as a pickle and load it later for scoring. Right now, I am using this code:
#save model as pickle
import cloudpickle
pickled = cloudpickle.dumps(final_model)
#load model
cloudpickle.loads(pickled)
Output: <econml.dml.causal_forest.CausalForestDML at 0x7f388e70c373>
My worry is that with this approach the model will be saved only in a session-variable "pickled" of the notebook in Databricks.
I want the model to be stored in a DBFS storage though, so I can pull this model at any time (even if my notebook session expires) to make it more robust.
How would I do this?
You need to use /dbfs/<DBFS path> to save model using the Local File API (please note that it's not supported on community edition).
But really, I would recommend to rely on the MLflow to log model & all necessary hyperparameters into MLflow Model Registry - then you can easily load model using model name instead of of using paths. With MLflow you can get more benefits, like, tracking multiple model versions, using stages, etc.

Storing images through a Flask app. Images in Postgres or S3 filepaths in Postgres table? [duplicate]

This question already has answers here:
Which SQLAlchemy column type should be used for binary data?
(2 answers)
Closed 5 years ago.
I'm working on a flask backend app. There are some profile images coming in from the android frontend to the flask API endpoint. I want to store these images.
Tech stack : Android app, Flask API/backend, Postgres, AWS services.
What would be the best idea?
I thought of the following ideas. Do let me know if any of these ideas make sense!
Storing the images directly in the Postgres database. ( i think this is bad as it will put a load on the database).
Storing the images in Amazon S3 buckets and S3 file paths as one of the values in the Postgres table.
An improvement for 2 i thought would be - Have the arrangement like 2, and have a CDN such as Amazon Cloudfront for CDN and faster distribution to other services requesting the images.
How does it sound? Any other ideas?
Thanks! :)
Option 2 is a commonly used approach, but depending on how it is used, the lo type in Postgres for binary large objects could also be OK. Primarily, this is about the sort of throughput you need for writing the images, and what sort of request interface is used for other applications that retrieve the images (e.g. serializing and deserializing blob images through a Python web framework would not be a good idea, but for certain use cases, doing something like a gRPC server might be fine this way).
For option 2, it's a good idea to think about how you can create some type of image ID for the images, that will be robust to changes in file paths or names, format changes, storage bucket changes, etc. It can be a pain, but getting something set up where all these pesky aspects about image specifiers are properly normalized is worth it.
It's also worth it to think about whether you need a range of sizes or post-processing outputs, like storing 'thumbnail', 'original size' and 'small' versions of the images. It can be a huge pain to backfill this stuff if it's not laid out in a sane way up front.
So much of this depends on your actual use case. How will the data set grow? What sort of latency is required when people requests image data? Will you ever possibly store other types of assets besides images? What sort of throughput demands are there, both for ingestion and for serving the images later.

How to develop a REST API using an ML model trained on Apache Spark?

Assume this scenario:
We analyze the data, train some machine learning models using whatever tool we have at hand, and save those models. This is done in Python, using Apache Spark python shell and API. We know Apache Spark is good at batch processing, hence a good choice for the aboce scenario.
Now going into production, for each given request, we need to return a response which depends also on the output of the trained model. This is, I assume, what people call stream processing, and Apache Flink is usually recommended for it. But how would you use the same models trained using tools available in Python, in a Flink pipeline?
The micro-batch mode of Spark wouldn't work here, since we really need to respond to each request, and not in batches.
I've also seen some libraries trying to do machine learning in Flink, but that doesn't satisfy needs of people who have diverse tools in Python and not Scala, and are not even familiar with Scala.
So the question is, how do people approach this problem?
This question is related, but not a duplicate, since the author there mentions explicitly using Spark MLlib. That library runs on JVM, and has more potential to be ported to other JVM based platforms. But here the question is how would people approach it if let say, they were using scikit-learn, or GPy or whatever other method/package they use.
I needed a way of creating a custom Transformer for an ml Pipeline and have that custom object be saved/loaded along with the rest of the pipeline. This led me to digging into the very ugly depths of spark model serialisation / deserialisation. In short it looks like all the spark ml models have two components metadata and model data where model data is what ever parameters were learned during .fit(). The metadata is saved in a directory called metadata under the model save dir and as far as I can tell is json so that shouldn't be an issue. The model parameters themselves seem to be saved just as a parquet file in the save dir. This is the implementation for saving an LDA model
override protected def saveImpl(path: String): Unit = {
DefaultParamsWriter.saveMetadata(instance, path, sc)
val oldModel = instance.oldLocalModel
val data = Data(instance.vocabSize, oldModel.topicsMatrix, oldModel.docConcentration,
oldModel.topicConcentration, oldModel.gammaShape)
val dataPath = new Path(path, "data").toString
sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath)
}
notice the sqlContext.createDataFrame(Seq(data)).repartition(1).write.parquet(dataPath) on the last line. So the good news is that you could load file into your webserver, and if the server is in Java/Scala you'll just need to keep the spark jars in the classpath.
If however you're using say python for the webserver you could use a parquet library for python (https://github.com/jcrobak/parquet-python) the bad news is that some or all of the objects in the parquet file are going to be binary Java dumps so you can't actually read them in python. A few options come to mind, use Jython (meh), use Py4J to load the objects, this is what pyspark uses to communicate with the JVM so this could actually work. I wouldn't expect this to be exactly straightforward though.
Or from the linked question use jpmml-spark and hope for the best.
Have a look at MLeap.
We have had some success at externalizing the model learned on Spark into separate services which provide prediction on the new incoming data. We externalized the LDA topic modelling pipeline, albeit for in Scala. But they do have python support so it's worth looking at.

data for my app in AppEngine (python)

I'm trying to import some data to my app, i'm 'n00b', thats my 1st app, so any advice is welcome,
the data that i will upload is about 350mb, its a .csv, i tryed to upload with the bulk upload, but i got the error :
Unable to download kind stats for all-kinds downloa
Kind stats are generated periodically by the appser
Kind stats are not available on dev_appserver.
i searched a little, and i found that is because the app doesnt have de stats, i create the model and populate with some entitie, it was some days ago and the app doesnt have the statistics, i need a advice, what would be the best way to upload the amount of data to the app ?
if is useful, the data is here:
http://dados.gov.br/dataset/atendimentos-de-consumidores-nos-procons-sindec
thanks
You must be using the autotmatic option generate bulkloader.yaml . For doing the auto generation, app engine uses the datastore statistics as the reference. Initially you had no data and hence no statistics and so you got the error related to no data stat.
Later when you added data to your datastore, your data statistics may still be empty because it normally takes upto 24 hours and even upto few days for app engine to update datastore statistics . So your data stats must still be empty and so auto generation option will not work.
You can follow the bulk upload documentation and manually create a bulkloader.yaml based on the example given there and modify to match your datastore fields and their types. Using this manually generated file, you should be able to proceed.

Importing a CSV file into a PostgreSQL DB using Python-Django

Note: Scroll down to the Background section for useful details. Assume the project uses Python-Django and South, in the following illustration.
What's the best way to import the following CSV
"john","doe","savings","personal"
"john","doe","savings","business"
"john","doe","checking","personal"
"john","doe","checking","business"
"jemma","donut","checking","personal"
Into a PostgreSQL database with the related tables Person, Account, and AccountType considering:
Admin users can change the database model and CSV import-representation in real-time via a custom UI
The saved CSV-to-Database table/field mappings are used when regular users import CSV files
So far two approaches have been considered
ETL-API Approach: Providing an ETL API a spreadsheet, my CSV-to-Database table/field mappings, and connection info to the target database. The API would then load the spreadsheet and populate the target database tables. Looking at pygrametl I don't think what i'm aiming for is possible. In fact, i'm not sure any ETL APIs do this.
Row-level Insert Approach: Parsing the CSV-to-Database table/field mappings, parsing the spreadsheet, and generating SQL inserts in "join-order".
I implemented the second approach but am struggling with algorithm defects and code complexity. Is there a python ETL API out there that does what I want? Or an approach that doesn't involve reinventing the wheel?
Background
The company I work at is looking to move hundreds of project-specific design spreadsheets hosted in sharepoint into databases. We're near completing a web application that meets the need by allowing an administrator to define/model a database for each project, store spreadsheets in it, and define the browse experience. At this stage of completion transitioning to a commercial tool isn't an option. Think of the web application as a django-admin alternative, though it isn't, with a DB modeling UI, CSV import/export functionality, customizable browse, and modularized code to address project-specific customizations.
The implemented CSV import interface is cumbersome and buggy so i'm trying to get feedback and find alternate approaches.
How about separating the problem into two separate problems?
Create a Person class which represents a person in the database. This could use Django's ORM, or extend it, or you could do it yourself.
Now you have two issues:
Create a Person instance from a row in the CSV.
Save a Person instance to the database.
Now, instead of just CSV-to-Database, you have CSV-to-Person and Person-to-Database. I think this is conceptually cleaner. When the admins change the schema, that changes the Person-to-Database side. When the admins change the CSV format, they're changing the CSV-to-Database side. Now you can deal with each separately.
Does that help any?
I write import sub-systems almost every month at work, and as I do that kind of tasks to much I wrote sometime ago django-data-importer. This importer works like a django form and has readers for CSV, XLS and XLSX files that give you lists of dicts.
With data_importer readers you can read file to lists of dicts, iter on it with a for and save lines do DB.
With importer you can do same, but with bonus of validate each field of line, log errors and actions, and save it at end.
Please, take a look at https://github.com/chronossc/django-data-importer. I'm pretty sure that it will solve your problem and will help you with process of any kind of csv file from now :)
To solve your problem I suggest use data-importer with celery tasks. You upload the file and fire import task via a simple interface. Celery task will send file to importer and you can validate lines, save it, log errors for it. With some effort you can even present progress of task for users that uploaded the sheet.
I ended up taking a few steps back to address this problem per Occam's razor using updatable SQL views. It meant a few sacrifices:
Removing: South.DB-dependent real-time schema administration API, dynamic model loading, and dynamic ORM syncing
Defining models.py and an initial south migration by hand.
This allows for a simple approach to importing flat datasets (CSV/Excel) into a normalized database:
Define unmanaged models in models.py for each spreadsheet
Map those to updatable SQL Views (INSERT/UPDATE-INSTEAD SQL RULEs) in the initial south migration that adhere to the spreadsheet field layout
Iterating through the CSV/Excel spreadsheet rows and performing an INSERT INTO <VIEW> (<COLUMNS>) VALUES (<CSV-ROW-FIELDS>);
Here is another approach that I found on github. Basically it detects the schema and allows overrides. Its whole goal is to just generate raw sql to be executed by psql and or whatever driver.
https://github.com/nmccready/csv2psql
% python setup.py install
% csv2psql --schema=public --key=student_id,class_id example/enrolled.csv > enrolled.sql
% psql -f enrolled.sql
There are also a bunch of options for doing alters (creating primary keys from many existing cols) and merging / dumps.

Categories

Resources