Say I would like to scrape json format information from an URL which refresh every 2 minutes. I need to run my pipeline (written in Python) continuously every 2 minutes to capture them without any missing data. In this case the pipeline is real time processing.
Currently, I'm using Jenkins to run the pipeline every 2 minutes but I do not think this is a correct practice and Jenkins is meant for CI/CD pipelines. I doubt mine is CI/CD pipeline. Even though I knew there is Jenkins pipeline plugin, I still think using the plugin is conceptually incorrect.
So, what is the best data processing tools in this case? At this point, no data transformation needed. I believe in the future as the process gets more complex, transformation is required. Just FYI, the data will pump into azure blob storage.
Jenkins is meant for CI/CD pipelines and not for scheduling jobs. So yes, you are right, Jenkins isn't the right tool for what you want to achieve.
I suppose you are using Microsoft Azure.
If you are using Databricks, use Spark Streaming application and crate the job.
https://learn.microsoft.com/en-us/azure/databricks/dev-tools/api/latest/jobs
The other approach is to use Azure DataFactory with Scheduled Trigger for the job. As you want to run every 2 minutes, I not sure what will be the cost involved.
As mentioned Jenkins probably should not be used as a distributed Cron system... Let it do what it's good at - CI.
You can boot any VM and create a Cron script or you can use a scheduler like Airflow
FWIW, "every 2 minutes" is not "real time"
Just FYI, the data will pump into azure blob storage.
If you are set on using Kafka (via Event Hub), you could use Kafka Connect to scrape the site, send data into a Kafka topic, then use a sink connector to send data to WASB
Related
I created a scraper in Python that is navigating a website. It pulls many links and then It has to visit every link pull the data and parse and store the result.
Is there an easy way to run that script distributed in the cloud (like AWS)?
Ideally, I would like something like this (probably is more difficult, but just to give an idea)
run_in_the_cloud --number-of-instances 5 scraper.py
after the process is done, the instances are killed, so it does not cost more money.
I remember I was doing something similar with hadoop and java with mapreduce long time ago.
If you can put your scraper in a docker image it's relatively trivial to run and scale dockerized applications using AWS ECS Fargate. Just create a task definition and point it at your container registry, then submit runTask requests for however many instances you want. AWS Batch is another tool you could use to trivially parallelize container instances too.
I want to run one of my python scripts using GCP. I am fairly new to GCP so I don't have a lot of idea.
My python script grabs data from BigQuery and perform these tasks
Several data processing operations
Build a ML model using KDTree and few clustering algorithms
Dumping the final result to a Big Query table.
This script needs to run every night .
So far I know I can use VMs , Cloud Run, Cloud function ( not a good option for me as it will take about an hour to finish everything) . What should be the best choice for me to run this?
I came across Dataflow, but I am curious to know if it's possible to run a custom python script that can do all these things in google cloud dataflow (assuming I will have to convert everything into map-reduce format that doesn't seem easy with my code especially the ML part)?
Do you just need a python script to run on a single instance for a couple hours and then terminate?
You could setup a 'basic scaling' app-engine micro-service within your GCP project. The max run-time for taskqueue tasks is 24 hours when using 'basic scaling'.
Requests can run for up to 24 hours. A basic-scaled instance can choose to handle /_ah/start and execute a program or script for many hours without returning an HTTP response code. Task queue tasks can run up to 24 hours.
https://cloud.google.com/appengine/docs/standard/python/how-instances-are-managed
I designed a beam / dataflow pipeline using the beam python library. The pipeline roughly does the following:
ParDo: Collect JSON data from an API
ParDo: Transform JSON data
I/O: Write transformed data to BigQuery Table
Generally, the code does what it is supposed to do. However, when collecting a big dataset from the API (around 500.000 JSON files), the bigquery insert job stops right (=within one second) after it has been started without specific error message when using the DataflowRunner (it's working with DirectRunner executed on my computer). When using a smaller dataset,everything works just fine.
Dataflow log is as follows:
2019-04-22 (00:41:29) Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the...
Executing BigQuery import job "dataflow_job_14675275193414385105". You can check its status with the bq tool: "bq show -j --project_id=X dataflow_job_14675275193414385105".
2019-04-22 (00:41:29) Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /Wr...
Workflow failed. Causes: S01:Create Dummy Element/Read+Call API+Transform JSON+Write to Bigquery /WriteToBigQuery/NativeWrite failed., A work item was attempted 4 times without success. Each time the worker eventually lost contact with the service. The work item was attempted on:
beamapp-X-04212005-04211305-sf4k-harness-lqjg,
beamapp-X-04212005-04211305-sf4k-harness-lgg2,
beamapp-X-04212005-04211305-sf4k-harness-qn55,
beamapp-X-04212005-04211305-sf4k-harness-hcsn
Using the bq cli tool as suggested to get more information about the BQ load job does not work. The job cannot be found (and I doubt that it has been created at all due to instant failure).
I suppose I run into some kind of quota / bq restriction or even an out of memory issue (see: https://beam.apache.org/documentation/io/built-in/google-bigquery/)
Limitations
BigQueryIO currently has the following limitations.
You can’t sequence the completion of a BigQuery write with other steps of >your pipeline.
If you are using the Beam SDK for Python, you might have import size quota >issues if you write a very large dataset. As a workaround, you can partition >the dataset (for example, using Beam’s Partition transform) and write to >multiple BigQuery tables. The Beam SDK for Java does not have this >limitation as it partitions your dataset for you.
I'd appreciate any hint on how to narrow down the root cause for this issue.
I'd also like to try out a Partition Fn but did not find any python source code examples how to write a partitioned pcollection to BigQuery Tables.
One thing that might help the debugging is looking at the Stackdriver logs.
If you pull up the Dataflow job in the Google console and click on LOGS in the top right corner of the graph panel, that should open the logs panel at the bottom. The top right of the LOGS panel has a link to Stackdriver. This will give you a lot of logging information about your workers/shuffles/etc. for this particular job.
There's a lot in it, and it can be hard to filter out what's relevant, but hopefully you're able to find something more helpful than A work item was attempted 4 times without success. For instance, each worker occasionally logs how much memory it is using, which can be compared to the amount of memory each worker has (based on the machine type) to see if they are indeed running out of memory, or if your error is happening elsewhere.
Good luck!
As far as I know, there is no available option to diagnose OOM in Cloud Dataflow and Apache Beam's Python SDK (it is possible with Java SDK). I recommend you to open a feature request in the Cloud Dataflow issue tracker to get more details for this kind of issues.
Additionally to checking the Dataflow job log files, I recommend you to monitor your pipeline by using Stackdriver Monitoring tool that provides the resources usage per job (as the Total memory usage time).
Regarding to the Partition function usage in the Python SDK, the following code (based on the sample provide in Apache Beam's documentation) split the data into 3 BigQuery load jobs:
def partition_fn(input_data, num_partitions):
return int(get_percentile(lines) * num_partitions / 100)
partition = input_data | beam.Partition(partition_fn, 3)
for x in range(3):
partition[x] | 'WritePartition %s' % x >> beam.io.WriteToBigQuery(
table_spec,
schema=table_schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED)
I am developing a reporting service (i.e. Database reports via email) for a project on Google App Engine, naturally using the Google Cloud Platform.
I am using Python and Django but I feel that may be unimportant to my question specifically. I want to be able to allow users of my application schedule specific cron reports to send off at specified times of the day.
I know this is completely possible by running a cron on GAE on a minute-by-minute basis (using cron.yaml since I'm using Python) and providing the logic to determine which reports to run in whatever view I decide to make the cron hit, but this seems terribly inefficient to me, and seeing as the best answer I have found suggests doing the same thing (Adding dynamic cron jobs to GAE), I wanted an "updated" suggestion.
Is there at this point in time a better option than running a cron every minute and checking a DB full of client entries to determine which report to fire off?
You may want to have a look at the new Google Cloud Scheduler service (in beta at the moment), which is a fully managed cron job service. It allows you to create cron jobs programmatically via its REST API. So you could create a specific cron job per customer with the appropriate schedule to fit you needs.
Given this limit, my guess would be NO
Free applications can have up to 20 scheduled tasks. Paid applications can have up to 250 scheduled tasks.
https://cloud.google.com/appengine/docs/standard/python/config/cronref#limits
Another version of your minute-by-minute workaround would be a daily cron task that finds everyone that wants to be launched that day, and then use the _eta argument to pinpoint the precise moment in each day for each task to launch.
I'm trying to write a custom Source using the Python Dataflow SDK to read JSON data in parallel from a REST endpoint.
E.g. for a given set of IDs, I need to retrieve the data from:
https://foo.com/api/results/1
https://foo.com/api/results/2
...
https://foo.com/api/results/{maxID}
The key features I need are monitoring & rate limiting : even though I need parallelism (either thread/process-based or using async/coroutines), I also need to make sure my job stays "polite" towards the API endpoint - effectively avoiding involuntary DDoS.
Using psq, I should be able to implement some kind of rate-limit mechanism, but then I'd lose the ability to monitor progress & ETA using the Dataflow Service Monitoring
It seems that, although they work well together, monitoring isn't unified between Google Cloud Dataflow and Google Cloud Pub/Sub (which uses Google Stackdriver Monitoring)
How should I go about building a massively-parallel HTTP consumer workflow which implements rate limiting and has web-based monitoring ?
Dataflow does not currently have a built-in way of doing global rate-limiting, but you can use the Source API to do this. The key concept is that each split of a Source will be processed by a single thread at most, so you can implement local rate limiting separately for each split.
This solution does not use Pub/Sub at all, so you can exclusively use the Dataflow Monitoring UI. If you want to set up alerts based on particular events in your pipeline, you could do something like this