I have 2 pipelines:
Uploading data from a blob storage to the datastore (to update the data asset)
Training model
I also have 2 schedules:
Triggers every 10 minutes to call the pipeline 1
Triggers when there is a new data in datastore
I used this notebook as a starter code and modified it to fit my needs.
Code for Schedule 1
schedule = Schedule.create(
workspace=ws,
name="data_ingestion_schedule",
description="Schedule that updates the Dataset",
pipeline_parameters={"ds_name": DATASET_NAME},
pipeline_id=published_pipeline.id, # id of the published data upload pipeline
experiment_name=EXPERIMENT_NAME,
recurrence=ScheduleRecurrence(frequency="Minute", interval=10),
wait_for_provisioning=True
)
Code for Schedule 2
schedule = Schedule.create(
workspace=ws,
name="retraining_schedule",
description="Schedule that triggers training pipeline",
pipeline_parameters={"ds_name": DATASET_NAME, "model_name": MODEL_NAME},
pipeline_id=published_pipeline.id, # id of the published training pipeline
experiment_name=EXPERIMENT_NAME,
datastore=dstor,
path_on_datastore=DATASET_NAME,
wait_for_provisioning=True,
polling_interval=5
)
So, the problem is that the second schedule does not get triggered even where there is change in data in the datastore (at this path - dstore/DATASET_NAME). I waited and tried uploading multiple files, but the second schedule never get triggered.
I have tried it in different workspaces, datastore and path on datastore but did not succeed. I have tried searching on Google but havent found anything.
Ideally, I want to be able to upload new data to a blob storage -> then pipeline 1 (triggered by schedule 1) will take the data from the storage and upload to datastore -> pipeline 2 (triggered by schedule 2) takes the updated dataset and (re-)train the model.
Related
I have a Python based microservice where Cloud Api Python SDK is used to create and record custom metrics, code is shown below.
from google.api import label_pb2 as ga_label
from google.api import metric_pb2 as ga_metric
from google.cloud import monitoring_v3
client = monitoring_v3.MetricServiceClient()
project_name = f"projects/{project_id}"
descriptor = ga_metric.MetricDescriptor()
descriptor.type = "custom.googleapis.com/my_metric" + str(uuid.uuid4())
descriptor.metric_kind = ga_metric.MetricDescriptor.MetricKind.GAUGE
descriptor.value_type = ga_metric.MetricDescriptor.ValueType.DOUBLE
descriptor.description = "Custom Metric recording specific code level events."
Example logs are shown below. Sensitive data has been redacted.
Can I ask how to create a log-based metric to count the number of log entries that match a given filter?
Current Google cloud documentation is complex and doesn't clearly answer my application requirements. I've gone through this article published by Google but not being able to get the correct metrics created.
Unfortunately only data received after user-defined metrics have been created will be included.
The data for a user-defined log-based metric comes only from log entries received after the metric is created. A metric isn't retroactively populated with data from log entries that are already in Logging.
https://cloud.google.com/logging/docs/logs-based-metrics
For existing log entries, they should be kept for 30 days by default so you could look to import those into BigQuery and analyse with SQL. You could also setup a log sink for future log entries.
I have a data processing pipeline that consist of API Gateway Endpoint>Lambda Handler>Kinesis Delivery Stream>Lambda Transform Function>Datadog.
A request to my endpoint triggers around 160k records to be generated for processing (these are spread across 11 different delivery streams with exp back off on the Direct Put into the Delivery Stream).
My pipeline is consistently loosing around ~20k records (140k of the 160k show up in Datadog). I have confirmed through the metric aws.firehose.incoming_records that all 160k records are being submitted to the delivery stream.
Looking at the transform function's metrics, it shows no errors. I have error logging in the function itself which is not revealing any obvious issues.
In the Destination error logs details for the firehose, I do see the following:
{
"deliveryStreamARN": "arn:aws:firehose:us-east-1:837515578404:deliverystream/PUT-DOG-6bnC7",
"destination": "aws-kinesis-http-intake.logs.datadoghq.com...",
"deliveryStreamVersionId": 9,
"message": "The Lambda function was successfully invoked but it returned an error result.",
"errorCode": "Lambda.FunctionError",
"processor": "arn:aws:lambda:us-east-1:837515578404:function:prob_weighted_calculations-dev:$LATEST"
}
In addition there are records in my s3 bucket for failed deliveries. I re ran the failed records in the transform function (created a custom test event set based off of data in s3 bucket for failed delivery. The lambda executed without an issue.
I read that having a mismatch in the number of records sent to the transform function and then outputed by the transform could cause the above error log. So I put in explicit error checkin the transform that would trigger an error within the function if the output record did not match the input. Even with this, no errors in the lambda.
I am at a loss here as to what could be causing this and do not feel confident in my pipeline given ~20k records are "leaking" without explanation.
Any suggestion on where to look to continue troubleshooting this issue would be greatly appreciated!
Hi I would like to split a large bigquery table (10 Billion Event Records) into multiple tables based on the event_type in the large table.
Note the events table is time/day/event_time partitioned. Further assume that it as a year of data(365 days)
let's assume the event_type=['sign-up', 'page-view']
My approach:
Create a new table, each for the event type
Run and insert job, for each event type for each day[also i will be using dml inside a python script]
My questions:
what load job type should i use: copy or load job?
can i queue the load jobs to google big query?[would it work asynchronously?]
would google big query process this load job in parallel?
Is there anything I need to do it interms of using multiprocessing inorder to speed up the process? [the load job is handled by bigquery, if i can queue in the jobs than i don't need to do any multiprocessing on the client side]
Any pointers to an efficient solution is highly appreciated.
You can use query jobs for your requirement. Load jobs are used to ingest data into BigQuery from GCS buckets or local files.
The quotas and limits for query jobs can be found here. These quotas and limits apply to query jobs created automatically by running interactive queries, scheduled queries, and jobs submitted by using the jobs.query and query-type jobs.insert API methods. In a project, up to 300 concurrent API requests per second per user can be made.
The query jobs that use the jobs.insert method will be executed asynchronously. The same can be achieved using the Python client library (as you intended) as shown below. Please refer to this doc for more information.
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the destination table.
# table_id = "your-project.your_dataset.your_table_name"
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT corpus
FROM `bigquery-public-data.samples.shakespeare`
GROUP BY corpus;
"""
# Start the query, passing in the extra configuration.
query_job = client.query(sql, job_config=job_config) # Make an API request.
query_job.result() # Wait for the job to complete.
print("Query results loaded to the table {}".format(table_id))
Since the jobs will be running concurrently, there is no need to implement explicit multiprocessing.
I can create BigQuery ML models from the Google Big Query Web UI, but I'm trying to keep all of my code in python notebooks. Is there any way that I can create the models from the notebook without having to jump out to the web UI? I am able to use the predict function for creating model results from the Jupyter Notebook.
Thanks.
You don't need to do anything special, just run as a standalone query.
Create your dataset
Enter the following code to import the BigQuery Python client library and initialize a client. The BigQuery client is used to send and receive messages from the BigQuery API.
from google.cloud import bigquery
client = bigquery.Client(location="US")
Next, you create a BigQuery dataset to store your ML model. Run the following to create your dataset:
dataset = client.create_dataset("bqml_tutorial")
Create your model
Next, you create a logistic regression model using the Google Analytics sample dataset for BigQuery. The model is used to predict whether a website visitor will make a transaction. The standard SQL query uses a CREATE MODEL statement to create and train the model. Standard SQL is the default query syntax for the BigQuery python client library.
The BigQuery python client library provides a cell magic, %%bigquery, which runs a SQL query and returns the results as a Pandas DataFrame.
To run the CREATE MODEL query to create and train your model:
%%bigquery
CREATE OR REPLACE MODEL `bqml_tutorial.sample_model`
OPTIONS(model_type='logistic_reg') AS
SELECT
IF(totals.transactions IS NULL, 0, 1) AS label,
IFNULL(device.operatingSystem, "") AS os,
device.isMobile AS is_mobile,
IFNULL(geoNetwork.country, "") AS country,
IFNULL(totals.pageviews, 0) AS pageviews
FROM
`bigquery-public-data.google_analytics_sample.ga_sessions_*`
WHERE
_TABLE_SUFFIX BETWEEN '20160801' AND '20170630'
The query takes several minutes to complete. After the first iteration is complete, your model (sample_model) appears in the navigation panel of the BigQuery web UI. Because the query uses a CREATE MODEL statement to create a table, you do not see query results. The output is an empty DataFrame.
I wrote small django website that read the data from data base and show that data in a table. (Database will be filled by making the request to an external API. )
Now my problem is that I need to make the request every 5 minutes to API and get the last 5 mins data and store them in the data base and at the same time update my table to show last 5 mins data.
I have read about job scheduler but I did not get how I can perform it. First of all is an scheduler such as celery is a good solution for this problem? and would be helpful for me if you can guide me how would be the approach to solve this?
A simple solution I have used in the past is to write a django custom command and then have a cronjob run that command at whatever interval you would like.
Django commands: https://docs.djangoproject.com/en/1.11/howto/custom-management-commands/