Creating BigQueryML Model From Jupyter Notebook

Creating BigQueryML Model From Jupyter Notebook - python

I can create BigQuery ML models from the Google Big Query Web UI, but I'm trying to keep all of my code in python notebooks. Is there any way that I can create the models from the notebook without having to jump out to the web UI? I am able to use the predict function for creating model results from the Jupyter Notebook.
Thanks.

You don't need to do anything special, just run as a standalone query.
Create your dataset
Enter the following code to import the BigQuery Python client library and initialize a client. The BigQuery client is used to send and receive messages from the BigQuery API.
from google.cloud import bigquery

client = bigquery.Client(location="US")
Next, you create a BigQuery dataset to store your ML model. Run the following to create your dataset:
dataset = client.create_dataset("bqml_tutorial")
Create your model
Next, you create a logistic regression model using the Google Analytics sample dataset for BigQuery. The model is used to predict whether a website visitor will make a transaction. The standard SQL query uses a CREATE MODEL statement to create and train the model. Standard SQL is the default query syntax for the BigQuery python client library.
The BigQuery python client library provides a cell magic, %%bigquery, which runs a SQL query and returns the results as a Pandas DataFrame.
To run the CREATE MODEL query to create and train your model:
%%bigquery
CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
The query takes several minutes to complete. After the first iteration is complete, your model (sample_model) appears in the navigation panel of the BigQuery web UI. Because the query uses a CREATE MODEL statement to create a table, you do not see query results. The output is an empty DataFrame.

Related

How can I create custom metrics in Cloud Monitoring using existing Logs entries?

I have a Python based microservice where Cloud Api Python SDK is used to create and record custom metrics, code is shown below.
from google.api import label_pb2 as ga_label
from google.api import metric_pb2 as ga_metric
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{project_id}"
descriptor = ga_metric.MetricDescriptor()
descriptor.type = "custom.googleapis.com/my_metric" + str(uuid.uuid4())
descriptor.metric_kind = ga_metric.MetricDescriptor.MetricKind.GAUGE
descriptor.value_type = ga_metric.MetricDescriptor.ValueType.DOUBLE
descriptor.description = "Custom Metric recording specific code level events."
Example logs are shown below. Sensitive data has been redacted.
Can I ask how to create a log-based metric to count the number of log entries that match a given filter?
Current Google cloud documentation is complex and doesn't clearly answer my application requirements. I've gone through this article published by Google but not being able to get the correct metrics created.

Unfortunately only data received after user-defined metrics have been created will be included.
The data for a user-defined log-based metric comes only from log entries received after the metric is created. A metric isn't retroactively populated with data from log entries that are already in Logging.
https://cloud.google.com/logging/docs/logs-based-metrics
For existing log entries, they should be kept for 30 days by default so you could look to import those into BigQuery and analyse with SQL. You could also setup a log sink for future log entries.

Getting data from Cassandra tables to MongoDB/RDBMS in realtime

I have an application which is using Cassandra as a database.I need to create some kind of reports from the Cassanbdra DB data, but data is not modelled as per report queries. So one report may have data scattered in multiple tables. As Cassandra doesn't allow joins like RDBMS, this is not simple to do.So I am thinking of a solution to get the required tables data in some other DB (RDBMS or Mongo) in real time and then genereate the report from there. So do we have any standard way to get the data from Cassandra to other DBs (Mongo or RDBMS) in realtime i.e. whenever an insert/update/delete happens in Cassandra same has to eb updated in destination DB. Any example programe or code would be very helpful.

You would be better off using spark + spark cassandra connector combination to do this task. With Spark you can do joins in memory and write the data back to Cassandra or any text file.

Is there any way we can load BigTable data into BigQuery?

I want to load BigTable data into BigQuery with direct way.
Till now I am loading BigTable data into CSV file using Python and then loading csv file into BigQuery.
But I don't want to use csv file in between BigTable and BigQuery is there any direct way ?

To add to Mikhail's recommendation, I'd suggest creating a permanent table in BigQuery using the external table. You'll define the schema for the columns you want and then query the rows you're interested in. Once that data is saved into BigQuery, it won't have any impact on your Bigtable performance. If you want to get the latest data, you can create a new permanent table with the same query.

If you're looking to have the data copied over and stored in BigQuery, Querying Cloud Bigtable data using permanent external tables is not what you're looking for. It explicitly mentions that "The data is not stored in the BigQuery table". My understanding is that the permanent table is more for persistent access controls, but still queries Bigtable directly.
This may be overkill, but you could set up and Apache Beam pipeline that runs in Dataflow, has a BigQueryIO source, and a BigTableIO sink. You'd have to write a little bit of transformation logic, but overall it should be a pretty simple pipeline. The only catch here is that the BigTableIO connector is only for the Beam Java SDK, so you'd have to write this pipeline in Java.

Google BigQuery Results Don't Show

I created a python script that pushes a pandas dataframe into Google BigQuery and it looks as though I'm able to query the table directly from GBQ. However, another user is unable to view the results when they query from that same table I generated on GBQ. This seems to be a Big Query issue because when they tried to connect to GBQ and query the table indirectly using pandas, it seemed to work fine (pd.read_gbq("SELECT * FROM ...", project_id)). What is causing this strange behaviour?
What I'm seeing:
What they are seeing:

I've encountered this when loading tables to BigQuery via Python GBQ. If you take the following steps, the table will display properly
Load dataframe to BigQuery via Python GBQ
SELECT * FROM uploaded_dataset.uploaded_dataset; doing so will properly show the table
Within the BigQuery UI, save the table (as a new table name)
From there, you will be able to see the table properly. Unfortunately, I don't know how to resolve this without a manual step in the UI.

How to serve a Prophet model for my Django application?

I created a model with Facebook Prophet. I wonder now what is the "best" way to access these predictions from an online web application (Django).
Requirements are that I have to train/update my model on a weekly base with data from my Django application (PostgreSQL). The predictions will be saved and I want to be able to call/access this data from my Django application.
After I looked into Google Cloud and AWS I couldn't find any solution that hosts my model in a way that I can just access predictions via an API.
My best idea/approach to solve that right now:
1) Build a Flask application that trains my models on a weekly base. Predictions are saved in a PostgreSQL. The data will be a weekly CSV export from my Django web application.
2) Create an API in my Flask application, that can access predictions from the database.
3) From my Django application, I can call the API and access the data whenever needed.
I am pretty sure my approach sounds bumpy and is probably not the way how it is done. Do you have any feedback or ideas on how to solve it better? In short:
1) Predict data from a PostgresSQL database.
2) Serve predictions in a Django web application.

The simplest way to serve pre-calculated forecast values from Prophet is to serve CSV files from S3 or other file servers. You can refresh your models every few days and write the forecast output to S3
import boto3
from io import StringIO
DESTINATION = bucket_name
def write_dataframe_to_csv_on_s3(dataframe, filename):
""" Write a dataframe to a CSV on S3 """
print("Writing {} records to {}".format(len(dataframe), filename))
# Create buffer
csv_buffer = StringIO()
# Write dataframe to buffer
dataframe.to_csv(csv_buffer, sep=",", index=False)
# Create S3 object
s3_resource = boto3.resource("s3")
# Write buffer to S3 object
s3_resource.Object(DESTINATION, filename).put(Body=csv_buffer.getvalue())
results = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].copy()
write_dataframe_to_csv_on_s3(results, output+file_name+".csv")

One of the reasons I visited this question was that I wasn't sure about which way to go. The answer seems like a great alternative.
However, I didn't have many constraints on my Django application, and I was figuring out a simpler way for someone with similar use cases as mine.
My solution:
I have a Django project, a Django app that serves my website, and a Django app for the Prophet model.
The Prophet model will be re-trained each day exactly once (after some condition).
Each day the model is trained, it predicts for the new data and saves the predictions to a CSV file (which can be stored in a database). It also stores the trained model using pickle.
Now, I have access to the trained model and some pre-defined predictions by importing the Django app wherever I need it.
The project hierarchy:
project/
project/
django-app-for-website/
django-app-for-prophet/
manage.py
requirements.txt
Even though the performance of my project isn't affected much, it isn't my priority for now, but it can be yours, in which case, I wouldn't recommend this solution.
If you're looking for the simplest way to serve a Prophet model, this is what I could come up with. Just another possible solution.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Creating BigQueryML Model From Jupyter Notebook - python

Related

How can I create custom metrics in Cloud Monitoring using existing Logs entries?

Getting data from Cassandra tables to MongoDB/RDBMS in realtime

Is there any way we can load BigTable data into BigQuery?

Google BigQuery Results Don't Show

How to serve a Prophet model for my Django application?

Categories

Resources