I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.
The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?
The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?
I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!
ETL
Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)
You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:
`psql <authentication stuff> -c 'insert into z select a, b, from x'`
(Use psql v8, upon which Redshift was based.)
Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.
Machine Learning
Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.
Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.
Lots of options, indeed!
The 2 options for running ETL on Redshift
Create some "create table as" type SQL, which will take your source
tables as input and generate your target (transformed table)
Do the transformation outside of the database using an ETL tool. For
example EMR or Glue.
Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).
Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.
There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.
Basic transforming of existing data
It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:
You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order
to talk with Redshift from any workstation of you LAN (make sure
your IP has proper privilege)
You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation
logic.
Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here
(https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does
Best way to host my python logic
It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:
Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
If your result set are relatively smaller you can directly save them into the target redshift table
If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.
Hope this helps.
Related
I want make a data lake for my self without using any cloud service. I now have an Debian server and I want create this data lake with Databricks solution, Delta Lake.
As I search all sample for stablish Delta Lake in could service.
How can I do this in my own server?
Maybe I want create an cluster for store data and doing machine learning. And I want use only python for create Delta Lake.
It's a broad question. The Delta Lake itself is just a library that allows you to work with data in a specific format. To use it you need few things:
Compute layer that will read & save Delta Lake data. You can run Apache Spark on the local machine or on the Hadoop or Kubernetes cluster or work with Delta files using Python or Rust libraries (although you may not get all features available). Full list of integrations is available here.
Storage layer to keep your Delta Lake tables - if you use one server, then you can use local file system, but as data size grows then you need to think about distributed filesystem, like, HDFS, MinIO, etc.
Data access layer - how you will access that data. It could be Spark code, or something like that, but you may need also to expose data via JDBC/ODBC - in this case you may need to setup Spark's Thrift server or something like that.
I am a newbie in ETL. I just managed to extract a lot of information in form of JSONs to GCS. Each JSON file includes identical key-value pairs and now I would like to transform them into dataframes on the basis of certain key values.
The next step would be loading this into a data warehouse like Clickhouse, I guess? I was not able to find any tutorials on this process.
TLDR 1) Is there a way to transform JSON data on GCS in Python without downloading the whole data?
TLDR 2) How can I set this up to run periodically or in real time?
TLDR 3) How can I go about loading the data into a warehouse?
If these are too much, I would love it if you can point me to resources around this. Appreciate the help
There are some ways to do this.
You can add files to storage, then a Cloud Functions is activated every time a new file is added (https://cloud.google.com/functions/docs/calling/storage) and will call an endpoint in Cloud Run (container service - https://cloud.google.com/run/docs/building/containers) running a Python application to transform these JSONs in a dataframe. Note that the container image will be stored in Container Registry. Then the Python notebook running on Cloud Run will save the rows incrementally to BigQuery (warehouse). After that you can have analytics with Looker Studio.
If you need to scale the solution to millions/billions of rows, you can add files to storage, Cloud Functions is activated and calls Dataproc, a service where you can run Python, Anaconda, etc. (How to call google dataproc job from google cloud function). Then this Dataproc cluster will structurate the JSONs as a dataframe and save to the warehouse (BigQuery).
Scenario:
I have a AWS Glue job which deals with S3 and performs some crawling to insert data from s3 files to postgres in rds.
Because of the file size being sometimes very large it takes up huge time to perform the operation, per say the amount of time the job runs is more then 2 days.
Script for job is written in python
I am looking for a way to be able to enhance the job in some ways such as:
Some sort of multi-threading options within the job to perform faster execution - is this feasible? any options/alternative for this?
Is there any hidden or unexplored option of AWS which I can try for this sort of activity?
Any out of the box thoughts?
Any response would be appreciated, thank you!
IIUC you need not to crawl the complete data if you just need to dump it in rds. So crawler is useful if you are going to query over that data using Athena or any other glue component but if you need to just dump the data in rds you can try following options.
You can use glue spark job to read all the files and using jdbc connection to your rds load the data into postgres.
Or you can use normal glue gob and pg8000 library to load the files into postgres. You can utilize batch load from this utility,
We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.
I'm kind-of at a crossroads in my application - where I'm using python/django, mysql, and ubuntu 12.04
My application will be accessing other applications online, making indexes of their path structure, and submitting forms. If you think of this happening with 10s or 100s of accounts with 1 or more domain names each, the performance can get a little out of hand.
My initial thinking was to setup an ec2 environment to distribute the load of accessing all of these paths on each domain across many ec2 instances, each running celery/rabbitmq to distribute the processing load across these ec2 instances.
The thing is - I want to store the results of submitting forms in which I access. I read that I would likely need to use a nosql db (e.g. hadoop, redis, etc).
My question to you all is:
Is there a different way to use celery/rabbitmq with a SQL-db and what are the advantages/disadvantages?
I can see one problem with having to use nosql : the learning curve .
Secondly: is there some other way to distribute the (processing) load of several python scripts being run at the same time on multiple ec2 environments?
Thank you.
Is there a different way to use celery/rabbitmq with a SQL-db and what
are the advantages/disadvantages? I can see one problem with having to
use nosql : the learning curve
Yes.
If you are talking about storing your Django application/model data, you can use it with any SQL type of database as long as you have the Python bindings for it. Most popular SQL databases have python binding.
If you are referring to storing task results in a specific backend there's support for multiple databases/protocols SQL and noSQL. I believe there's no specific advantage or disadvantage between storing the results either in SQL (MySQL, Posgtgres) or noSQL (Mongo, CouchDB), but that's just my personal opinion and that depends on what type of application you are running. These are some of the examples that you can use for SQL databases (from their docs):
# sqlite (filename) CELERY_RESULT_BACKEND = ‘db+sqlite:///results.sqlite’
# mysql CELERY_RESULT_BACKEND = ‘db+mysql://scott:tiger#localhost/foo’
# postgresql CELERY_RESULT_BACKEND = ‘db+postgresql://scott:tiger#localhost/mydatabase’
# oracle CELERY_RESULT_BACKEND = ‘db+oracle://scott:tiger#127.0.0.1:1521/sidname’
If you are referring to a broker (queuing mechanism), celery only supports RabbitMQ and redis.
Secondly: is there some other way to distribute the (processing) load
of several python scripts being run at the same time on multiple ec2
environments?
That's exactly what celery does, you can setup your workers on multiple machines which can be different EC2 instances. Then all you have to do is point their celery installations to the same queues/broker in your configs. If you want redundancy in your broker (RabbitMQ and/or Redis) you should look at setting them up in clustered configs.