I've written an AWS glue job ETL script in python, and I'm looking for the proper way to perform conditional writes to the DynamoDb table I'm using as the target.
# Write to DynamoDB
glueContext.write_dynamic_frame_from_options(
frame=SelectFromCollection_node1665510217343,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": args["OUTPUT_TABLE_NAME"]
}
)
My script is writing to dynamo with write_dynamic_frame_from_options. The aws glue connection parameter docs make no mention of the ability to customize the write behavior in the connection options.
Is there a clean way to write conditionally without using boto?
You cannot do conditional updates with the EMR DynamoDB connector which Glue uses. It does a complete overwrite of the data. For that you would have to use Boto3 and distribute it using forEachPartition across the Spark executors.
Scenario:
I have a AWS Glue job which deals with S3 and performs some crawling to insert data from s3 files to postgres in rds.
Because of the file size being sometimes very large it takes up huge time to perform the operation, per say the amount of time the job runs is more then 2 days.
Script for job is written in python
I am looking for a way to be able to enhance the job in some ways such as:
Some sort of multi-threading options within the job to perform faster execution - is this feasible? any options/alternative for this?
Is there any hidden or unexplored option of AWS which I can try for this sort of activity?
Any out of the box thoughts?
Any response would be appreciated, thank you!
IIUC you need not to crawl the complete data if you just need to dump it in rds. So crawler is useful if you are going to query over that data using Athena or any other glue component but if you need to just dump the data in rds you can try following options.
You can use glue spark job to read all the files and using jdbc connection to your rds load the data into postgres.
Or you can use normal glue gob and pg8000 library to load the files into postgres. You can utilize batch load from this utility,
We can basically use databricks as intermediate but I'm stuck on the python script to replicate data from blob storage to azure my sql every 30 second we are using CSV file here.The script needs to store the csv's in current timestamps.
There is no ready stream option for mysql in spark/databricks as it is not stream source/sink technology.
You can use in databricks writeStream .forEach(df) or .forEachBatch(df) option. This way it create temporary dataframe which you can save in place of your choice (so write to mysql).
Personally I would go for simple solution. In Azure Data Factory is enough to create two datasets (can be even without it) - one mysql, one blob and use pipeline with Copy activity to transfer data.
I need to export my route table detail to CSV. The goal is to load the CSV to a graphDB. What I need is the RoutetableID, CIDR block, Gateway and associated Subnet list as a CSV table. Would like to automate this as much as I can since I have several VPCs that I need to consolidate data for.
1- I can not find a BOTO3 query that provides this depth of detail.
2- AWS-config as well do not go to this depth.
3- AWS-CLI will get me the detail as a nested JSON. But here I loose the easy of automation.
Am I missing a detail here or does AWS not expose this detail for automation?
In AWS EC2 CLI, using describe-route-tables (link) you can fetch routetableID, CIDR block, gateway and associated subnet list. If you use this CLI, the automation will be fetching your four key value from JSON, converting into an array and writing it in a CSV.
The boto3 equivalent for this will be using RouteTable and RouteTableAssociation. This is a good example in python.
I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.
The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?
The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?
I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!
ETL
Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)
You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:
`psql <authentication stuff> -c 'insert into z select a, b, from x'`
(Use psql v8, upon which Redshift was based.)
Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.
Machine Learning
Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.
Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.
Lots of options, indeed!
The 2 options for running ETL on Redshift
Create some "create table as" type SQL, which will take your source
tables as input and generate your target (transformed table)
Do the transformation outside of the database using an ETL tool. For
example EMR or Glue.
Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).
Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.
There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.
Basic transforming of existing data
It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:
You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order
to talk with Redshift from any workstation of you LAN (make sure
your IP has proper privilege)
You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation
logic.
Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here
(https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does
Best way to host my python logic
It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:
Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
If your result set are relatively smaller you can directly save them into the target redshift table
If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.
Hope this helps.