Conditional writes to DynamoDB when executing an AWS glue script without Boto?

Conditional writes to DynamoDB when executing an AWS glue script without Boto? - python

I've written an AWS glue job ETL script in python, and I'm looking for the proper way to perform conditional writes to the DynamoDb table I'm using as the target.
# Write to DynamoDB
glueContext.write_dynamic_frame_from_options(
frame=SelectFromCollection_node1665510217343,
connection_type="dynamodb",
connection_options={
"dynamodb.output.tableName": args["OUTPUT_TABLE_NAME"]
}
)
My script is writing to dynamo with write_dynamic_frame_from_options. The aws glue connection parameter docs make no mention of the ability to customize the write behavior in the connection options.
Is there a clean way to write conditionally without using boto?

You cannot do conditional updates with the EMR DynamoDB connector which Glue uses. It does a complete overwrite of the data. For that you would have to use Boto3 and distribute it using forEachPartition across the Spark executors.

Related

Load local files to AWS stage in Snowflake using Python

I am planning to ingest data from local system to snowflake table using the Amazon s3 internal stage. How can I load the data to the s3 stage using python ? Previously for loading the data to snowflake I was using snowflake's internal staging and using the below command to perform the operation.
put file://<local_file_location> #<creating_stage_snowflake> auto_compress=true
copy into <table_name> from #<creating_stage_snowflake>/<file_name>.gz file_format = (TYPE=CSV FIELD_DELIMITER='~' error_on_column_count_mismatch=false, ENCODING = 'UTF-8')
What should be the approach to load the data from local system to amazon s3 and then copying these files from the staged table to the snowflake table using python.
Please share your inputs.

I use the Snowsql tool with a command line, that refers to a small sql script that I created in my code. Its not very elegant, but it does the job. You can hide the connection info for snowsql in the .snowsql\config file.
This works well for individual programs, but is probably not the best solution for a multi process web service.

AWS Glue advice needed for scaling or performance evaluation

Scenario:
I have a AWS Glue job which deals with S3 and performs some crawling to insert data from s3 files to postgres in rds.
Because of the file size being sometimes very large it takes up huge time to perform the operation, per say the amount of time the job runs is more then 2 days.
Script for job is written in python
I am looking for a way to be able to enhance the job in some ways such as:
Some sort of multi-threading options within the job to perform faster execution - is this feasible? any options/alternative for this?
Is there any hidden or unexplored option of AWS which I can try for this sort of activity?
Any out of the box thoughts?
Any response would be appreciated, thank you!

IIUC you need not to crawl the complete data if you just need to dump it in rds. So crawler is useful if you are going to query over that data using Athena or any other glue component but if you need to just dump the data in rds you can try following options.
You can use glue spark job to read all the files and using jdbc connection to your rds load the data into postgres.
Or you can use normal glue gob and pg8000 library to load the files into postgres. You can utilize batch load from this utility,

Snowflake - compare 2 tables and send notification for mismatches

I am looking for setting up a alert notification either from snowflake or aws side or by glue jobs / lambda functions using python or scala.
I would like to compare 2 tables which holds table names and counts in source and target.
data is loaded from s3 to snowflake via aws glue job and after that I would like to compare the 2 tables to verify if source and target record counts are matching and for any mismatches send a notification.
Please let me know your inputs to achieve this task.
Thanks,
Jo

If you are using AWS Glue to load the tables in Snowflake, you can continue using Glue to orchestrate the desired result:
Have Glue load the table.
Have Glue run a stored procedure in Snowflake comparing both tables.
https://snowflakecommunity.force.com/s/article/How-to-Use-AWS-Glue-to-Call-Procedures-in-Snowflake
Have AWS Glue send a notification through SNS.
https://aws.amazon.com/blogs/big-data/build-and-automate-a-serverless-data-lake-using-an-aws-glue-trigger-for-the-data-catalog-and-etl-jobs/
See the chapter "Monitoring and notification with Amazon CloudWatch Events".
If you need SQL for the stored procedure that compares two tables, please feel free to add a new question.

AWS Redshift Data Processing

I'm working with a small company currently that stores all of their app data in an AWS Redshift cluster. I have been tasked with doing some data processing and machine learning on the data in that Redshift cluster.
The first task I need to do requires some basic transforming of existing data in that cluster into some new tables based on some fairly simple SQL logic. In an MSSQL environment, I would simply put all the logic into a parameterized stored procedure and schedule it via SQL Server Agent Jobs. However, sprocs don't appear to be a thing in Redshift. How would I go about creating a SQL job and scheduling it to run nightly (for example) in an AWS environment?
The other task I have involves developing a machine learning model (in Python) and scoring records in that Redshift database. What's the best way to host my python logic and do the data processing if the plan is to pull data from that Redshift cluster, score it, and then insert it into a new table on the same cluster? It seems like I could spin up an EC2 instance, host my python scripts on there, do the processing on there as well, and schedule the scripts to run via cron?
I see tons of AWS (and non-AWS) products that look like they might be relevant (AWS Glue/Data Pipeline/EMR), but there's so many that I'm a little overwhelmed. Thanks in advance for the assistance!

ETL
Amazon Redshift does not support stored procedures. Also, I should point out that stored procedures are generally a bad thing because you are putting logic into a storage layer, which makes it very hard to migrate to other solutions in the future. (I know of many Oracle customers who have locked themselves into never being able to change technologies!)
You should run your ETL logic external to Redshift, simply using Redshift as a database. This could be as simple as running a script that uses psql to call Redshift, such as:
`psql <authentication stuff> -c 'insert into z select a, b, from x'`
(Use psql v8, upon which Redshift was based.)
Alternatively, you could use more sophisticated ETL tools such as AWS Glue (not currently in every Region) or 3rd-party tools such as Bryte.
Machine Learning
Yes, you could run code on an EC2 instance. If it is small, you could use AWS Lambda (maximum 5 minutes run-time). Many ML users like using Spark on Amazon EMR. It depends upon the technology stack you require.
Amazon CloudWatch Events can schedule Lambda functions, which could then launch EC2 instances that could do your processing and then self-Terminate.
Lots of options, indeed!

The 2 options for running ETL on Redshift
Create some "create table as" type SQL, which will take your source
tables as input and generate your target (transformed table)
Do the transformation outside of the database using an ETL tool. For
example EMR or Glue.
Generally, in an MPP environment such as Redshift, the best practice is to push the ETL to the powerful database (i.e. option 1).
Only consider taking the ETL outside of Redshift (option 2) where SQL is not the ideal tool for the transformation, or the transformation is likely to take a huge amount of compute resource.
There is no inbuilt scheduling or orchestration tool. Apache Airflow is a good option if you need something more full featured than cron jobs.

Basic transforming of existing data
It seems you are a python developer (as you told you are developing Python based ML model), you can do the transformation by following the steps below:
You can use boto3 (https://aws.amazon.com/sdk-for-python/) in order
to talk with Redshift from any workstation of you LAN (make sure
your IP has proper privilege)
You can write your own functions using Python that mimics stored procedures. Inside these functions, you can put / constrict your transformation
logic.
Alternatively, you can create function-using python in Redshift as well that will act like Stored Procedure. See more here
(https://aws.amazon.com/blogs/big-data/introduction-to-python-udfs-in-amazon-redshift/)
Finally, you can use windows scheduler / corn job to schedule your Python scripts with parameters like SQL Server Agent job does
Best way to host my python logic
It seems to me you are reading some data from Redshift then create test and training set and finally get some predicted result (records).If so:
Host the scrip in any of your server (LAN) and connect to Redshift using boto3. If you need to get large number of rows to be transferred over internet, then EC2 in the same region will be an option. Enable the EC2 in ad-hoc basis, complete you job and disable it. It will be cost effective. You can do it using AWS framework. I have done this using .Net framework. I assume boto3 does have this support.
If your result set are relatively smaller you can directly save them into the target redshift table
If result sets are larger save them into CSV (there are several Python libraries) and upload the rows into a staging table using copy command if you need any intermediate calculation. If not, upload them directly into the target table.
Hope this helps.

Automation testing for aws lambda functions in python

I have a aws lambda function which will write s3 file metadata information in dynamodb for every object created in s3 bucket, for this I have event trigger on s3 bucket. So i'm planning to automate testing using python. Can any one help out how I can automate this lambda function to test the following using unittest package.
Verify the dynamodb table existency
Validate whether the bucket exists or not in s3 for event trigger.
Verify the file count in s3 bucket and record count in Dynamodb table.

This can be done using moto and unittest. What moto will do is add in a stateful mock for AWS - your code can continue calling boto like normal, but calls won't actually be made to AWS. Instead, moto will build up state in memory.
For example, you could
Activate the mock for DynamoDB
create a DynamoDB table
Add items to the table
Retrieve items from the table and see they exist
If you're building functionality for both DynamoDB and S3, you'd leverage both the mock_s3 and mock_dynamodb2 methods from moto.
I wrote up a tutorial on how to do this (it uses pytest instead of unittest but that should be a minor difference). Check it out: joshuaballoch.github.io/testing-lambda-functions/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.