S3 file to Mysql AWS via Airflow

S3 file to Mysql AWS via Airflow - python

I been learning how to use Apache-Airflow the last couple of months and wanted to see if anybody has any experience with transferring CSV files from S3 to a Mysql database in AWS(RDS). Or from my Local drive to MySQL.
I managed to send everything to an S3 bucket to store them in the cloud using airflow.hooks.S3_hook and it works great. I used boto3 to do this.
Now I want to push this file to a MySQL database I created in RDS, but I have no idea how to do it. Do I need to use the MySQL hook and add my credentials there and then write a python function?
Also, It doesn't have to be S3 to Mysql, I can also try from my local drive to Mysql if it's easier.
Any help would be amazing!

Airflow has S3ToMySqlOperator which can be imported via:
from airflow.providers.mysql.transfers.s3_to_mysql import S3ToMySqlOperator
Note that you will need to install MySQL provider.
For Airflow 1.10 series (backport version):
pip install apache-airflow-backport-providers-mysql
For Airflow >=2.0 (regular version currently in Beta):
pip install apache-airflow-providers-mysql
Example usage:
S3ToMySqlOperator(
s3_source_key='myfile.csv',
mysql_table='myfile_table',
mysql_duplicate_key_handling='IGNORE',
mysql_extra_options="""
FIELDS TERMINATED BY ','
IGNORE 1 LINES
""",
task_id= 'transfer_task',
aws_conn_id='aws_conn',
mysql_conn_id='mysql_conn',
dag=dag
)

were you able to resolve the 'MySQLdb._exceptions.OperationalError: (2068, 'LOAD DATA LOCAL INFILE file request rejected due to restrictions on access' issue

Related

Is there a way to programmatically DROP a SQL Database in Azure?

I am working on a process to automatically remove and add databases to Azure. When the database isn't in use, it can be removed from Azure and placed in cheaper S3 storage as a .bacpac.
I am using SqlPackage.exe from Microsoft as a PowerShell script to export and import these databases from and to Azure respectively in either direction. I invoke it via a Python script to use boto3.
The issue I have is with the down direction at step 3. The sequence would be:
Download the Azure SQL DB to a .bacpac (can be achieved with SqlPackage.exe)
Upload this .bacpac to cheaper S3 storage (using boto3 Python SDK)
Delete the Azure SQL Database (It appears the Azure Blob Python SDK can't help me, and it appears SQLPackage.exe does not have a delete function)
Is step 3 impossible to automate with a script? Could a workaround be to SqlPackage.exe import a small dummy .bacpac with the same name to overwrite the old bigger DB?
Thanks.

To remove an Azure SQL Database using PowerShell, you will need to use Remove-AzSqlDatabase Cmdlet.
To remove an Azure SQL Database using Azure CLI, you will need to us az sql db delete.
If you want to write code in Python to delete the database, you will need to use Azure SDK for Python.

how to query RDS SQL Server database in AWS lambda using python?

I am trying to connect to AWS RDS SQL Server instance to query table from AWS Lambda using python script. But, I am not seeing any AWS api so when I try using "import pyodbc" seeing the below error.
Unable to import module 'lambda_function': No module named 'pyodbc'
Connection:
cnxn = pyodbc.connect("Driver={SQL Server};"
"Server=data-migration-source-instance.asasasas.eu-east-1.rds.amazonaws.com;"
"Database=sourcedb;"
"uid=source;pwd=source1234")
Any points on how to query RDS SQL Server?

The error you're getting means that the lambda doesn't have the pyodbc module.
You should read up on dependency management in AWS Lambda. There are basically two strategies for including dependencies with your deployment - Lambda Layers or zip with the deployment package.
If you're using the Serverless Framework then Serverless-python-requirements is an excellent package for managing your dependencies and lets you choose your dependency management strategy with minimal changes to your application.

you need to upload the dependencies of the lambda along with the code. If you deploy your lambda manually (i.e. create a zip file / right from the console), you will need to attach the pyodb library. (More information is available here: https://docs.aws.amazon.com/lambda/latest/dg/python-package.html#python-package-dependencies).
If you're using any other deployment tool (serverless, SAM, chalice), it will be much easier: https://www.serverless.com/plugins/serverless-python-requirements, https://aws.github.io/chalice/topics/packaging.html#rd-party-packages, https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-using-build.html

CSV-File from AWS S3 into PostgreSQL Amazon RDS using python

Status:
I have created new tables in PostgreSQL-Database on Amazon RDS
I have uploaded a csv-file into Bucket on Amazon S3
via lambda function I have connected with Amazon S3 Buckets and Amazon RDS
I can read csv-file via the following code
import csv, io, boto3
s3 = boto3.resource('s3')
client = boto3.client('s3',aws_access_key_id=Access_Key,aws_secret_access_key=Secret_Access_Key)
buf = io.BytesIO()
s3.Object('bucketname','filename.csv').download_fileobj(buf)
buf.seek(0)
while True:
line = buf.readlines(1)
print(line)
Problem:
I can't import necessary python libraries e.g. psycopg2, openpyxl etc.
when I tried to import psycopg2
import psycopg2
I got the error info:
Unable to import module 'myfilemane': No module named 'psycopg2._psycopg'
at first I have not imported the module "psycopg2._psycopg" but "psycopg2". I don't know where is the suffix '_psycopg' from
secondly I followed all the steps in the documentation:
https://docs.aws.amazon.com/lambda/latest/dg/lambda-python-how-to-create-deployment-package.html (1. create a directory. 2. Save all of your Python source files (the .py files) at the root level of this directory. 3. Install any libraries using pip at the root level of the directory. 4. Zip the content of the project-dir directory)
And I have also read this documentation:
https://docs.aws.amazon.com/lambda/latest/dg/vpc-rds-deployment-pkg.html
The same applies to other modules or libraries e.g. openpyxl etc. I was always told that "No Module Named 'OneNameThatIHaveNotImported'"
So does anyone have any idea or who know another way how can one via lambda-function edit the csv-file on s3 and import the edited version into rds-database?
Thanks for the help in advance!

the answer thread this SO answer references will put you on the right path. basically, you'd need to create the deployment package in an EC2 that matches the linux image the AWS lambda functions runs on. better yet, you can deploy lambda functions from the same staging EC2 instance where you created your deployment package through the AWS CLI.
you can also use [precompiled lambda packages][2] if you want an out-of-the-box fix.
[2]: https://github.com/jkehler/awslambda-psycopg2 or more generally, https://github.com/Miserlou/lambda-packages

BigQuery on Python

I need to run BigQuery on Python but the Google BigQuery module doesn't exist
from google.cloud import bigquery
client = bigquery.Client(project='PROJECT_ID')
query = "SELECT...."
dataset = client.dataset('dataset')
table = dataset.table(name='table')
job = client.run_async_query('my-job', query)
job.destination = table
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()
Do you guys know how to do the connection?

Looks like you do not have bigquery module installed, you could install it with -
pip install --upgrade google-cloud-bigquery
Ref - Installing the client library

As per Document, need to install client library for bigquery.

One thing that you need to correct is to setup credentials to connect your big query with PYTHON. You will also need to setup environment variable GOOGLE_APPLICATION_CREDENTIALS pointing towards the location of your credential file .

If your problem is on the connection to BigQuery:
client = bigquery.Client() creates the connection using your default credentials. Default credentials can be set on terminal using gcloud auth login. More on that you can see here: https://cloud.google.com/sdk/gcloud/reference/auth/login
If your problem it to install the library, consider running on terminal pip install --upgrade google-cloud-bigquery -- library docs can be found here: https://googleapis.dev/python/bigquery/latest/index.html

Using pg_restore on dump file

I have a database on Heroku I'm trying to copy to my local machine.
I created a backup of the database by doing:
heroku pgbackups:capture
This create a dump file of the database which I downloaded by creating a URL link to it:
heroku pgbackups:url b004
But now I have a dump file and don't really know what do to with it. I tried
pg_restore
to restore the database but I don't know where that information went. I basically want to create a .db file out of this dump file. Is that possible?
Ultimately my end goal is to access this database -- so if another method of copying the db is better, I'm fine with that that.

Heroku does not allow you to use sqlite files, as they have a read only file system. But you can use Django to dump the data from Heroku into a JSON file via the dumpdata command, and them import that into your local dev environment.
Because it can be difficult to run commands that generate files on the web server using heroku run, I suggest you instead install django smuggler, which makes this operation a point and click affair in admin.

First of all you should install postgres in your local machine.
PostgreSQL cheat sheet for django beginners
Then, import your dump file with pg_restore command:
pg_restore -d yournewdb -U yournewuser --role=yournewuser /tmp/b001.dump
Thats all, your data is now cloned from your heroku app.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

S3 file to Mysql AWS via Airflow - python

were you able to resolve the 'MySQLdb._exceptions.OperationalError: (2068, 'LOAD DATA LOCAL INFILE file request rejected due to restrictions on access' issue

Related

Is there a way to programmatically DROP a SQL Database in Azure?

how to query RDS SQL Server database in AWS lambda using python?

CSV-File from AWS S3 into PostgreSQL Amazon RDS using python

BigQuery on Python

Using pg_restore on dump file

Categories

Resources