Unload to S3 with Python using IAM Role credentials - python

In Redshift, I run the following to unload data from a table into a file in S3:
unload('select * from table')
to 's3://bucket/unload/file_'
iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>'
I would like to do the same in Python- any suggestion how to replicate this? I saw examples using access key and secret, but that is not an option for me- need to use role based credentials on a non-public bucket.

You will need two sets of credentials. IAM credentials via an IAM Role to access the S3 bucket and Redshift ODBC credentials to execute SQL commands.
Create a Python program that connects to Redshift, in a manner similar to other databases such as SQL Server, and execute your query. This program will need Redshift login credentials and not IAM credentials (Redshift username, password).
The IAM credentials for S3 are assigned as a role to Redshift so that Redshift can store the results on S3. This is the iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>' part of the Redshift query in your question.
You do not need boto3 (or boto) to access Redshift, unless you plan to actually interface with the Redshift API (which does not access the database stored inside Redshift).
Here is an example Python program to access Redshift. The link to this code is here. Credit due to Varun Verma
There are other examples on the Internet to help you get started.
############ REQUIREMENTS ####################
# sudo apt-get install python-pip
# sudo apt-get install libpq-dev
# sudo pip install psycopg2
# sudo pip install sqlalchemy
# sudo pip install sqlalchemy-redshift
##############################################
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<<
DATABASE = "dbname"
USER = "username"
PASSWORD = "password"
HOST = "host"
PORT = ""
SCHEMA = "public" #default is "public"
####### connection and session creation ##############
connection_string = "redshift+psycopg2://%s:%s#%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###### All Set Session created using provided schema #######
################ write queries from here ######################
query = "unload('select * from table') to 's3://bucket/unload/file_' iam_role 'arn:aws:iam:<aws-account-id>:role/<role_name>';"
rr = s.execute(query)
all_results = rr.fetchall()
def pretty(all_results):
for row in all_results :
print "row start >>>>>>>>>>>>>>>>>>>>"
for r in row :
print " ----" , r
print "row end >>>>>>>>>>>>>>>>>>>>>>"
pretty(all_results)
########## close session in the end ###############
s.close()

Related

Connect to cloudSQL db using service account with pymysql or mysql.connector

I have a running CloudSQL instance running in another VPC and a nginx proxy to allow cross-vpc access.
I can access the db using a built-in user. But how can I access the DB using a Google Service Account?
import google.auth
import google.auth.transport.requests
import mysql.connector
from mysql.connector import Error
import os
creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)
connection = mysql.connector.connect(host=HOST,
database=DB,
user=SA_USER,
password=creds.token)
if connection.is_connected():
db_Info = connection.get_server_info()
print("Connected to MySQL Server version ", db_Info)
cur = connection.cursor()
cur.execute("""SELECT now()""")
query_results = cur.fetchall()
print(query_results)
When using mysql connnector, I get this error:
DatabaseError: 2059 (HY000): Authentication plugin 'mysql_clear_password' cannot be loaded: plugin not enabled
Then I tried using pymysql
import pymysql
import google.auth
import google.auth.transport.requests
import os
creds, project = google.auth.default()
auth_req = google.auth.transport.requests.Request()
creds.refresh(auth_req)
try:
conn = pymysql.connect(host=ENDPOINT, user=SA_USER, passwd=creds.token, port=PORT, database=DBNAME)
cur = conn.cursor()
cur.execute("""SELECT now()""")
query_results = cur.fetchall()
print(query_results)
except Exception as e:
print("Database connection failed due to {}".format(e))
Database connection failed due to (1045, "Access denied for user 'xx'#'xxx.xxx.xx.xx' (using password: YES)"
I guess these errors are all related to the token.
Anyone to suggest a proper way to get SA token to access CloudSQL DB?
PS: Using cloudsql auth proxy is not a good option for our architecture.
The error that you have mentioned in description , indicates an issue with authentication , to exactly understand what could have caused ,try these things
Verify the username and corresponding password.
Check the origin of the connection to see if it matches the URL where
the user has access privileges.
Check the user's grant privileges in the database.
As you are trying to access the DB using a Google Service Account then you should try to use the default service account credentials to include this authorization token for you. Check out the Client libraries and sample code page for more info.Alternatively, if you prefer to manually create the requests, you can use an Oauth 2.0 token. The Authorizing requests page has more information for how to create these.These access tokens are only valid for 60 minutes after which they expire - however once a token expires it does not disconnect clients but if that client connection is broken and must re-connect to the instance, and it's been more than an hour, then a new access token will need to be pulled and provided on that new connection attempt.
For your use case as you are not interested in cloud sql proxy, a service account IAM user is the better way to go.
Note that to get an appropriate access token the scope must be set to Cloud SQL Admin API.
It finally works.
I had to enforce SSL connection.
import pymysql
from google.oauth2 import service_account
import google.auth.transport.requests
scopes = ["https://www.googleapis.com/auth/cloud-platform", "https://www.googleapis.com/auth/sqlservice.admin"]
credentials = service_account.Credentials.from_service_account_file('key.json', scopes=scopes)
auth_req = google.auth.transport.requests.Request()
credentials.refresh(auth_req)
config = {'user': SA_USER,
'host': ENDPOINT,
'database': DBNAME,
'password': credentials.token,
'ssl_ca': './server-ca.pem',
'ssl_cert': './client-cert.pem',
'ssl_key': './client-key.pem'}
try:
conn = pymysql.connect(**config)
with conn:
print("Connected")
cur = conn.cursor()
cur.execute("""SELECT now()""")
query_results = cur.fetchall()
print(query_results)
except Exception as e:
print("Database connection failed due to {}".format(e))
I'd recommend using the Cloud SQL Python Connector it should make your life way easier!
It manages the SSL connection for you (no need for cert files!), takes care of the credentials (uses Application Default Credentials which you can set to service account easily) and allows you to login with Automatic IAM AuthN so that you don't have to pass the credentials token as a password.
Connecting looks like this:
from google.cloud.sql.connector import Connector, IPTypes
import sqlalchemy
import pymysql
# initialize Connector object
connector = Connector(ip_type=IPTypes.PRIVATE, enable_iam_auth=True,)
# function to return the database connection
def getconn() -> pymysql.connections.Connection:
conn: pymysql.connections.Connection = connector.connect(
"project:region:instance", # your Cloud SQL instance connection name
"pymysql",
user="my-user",
db="my-db-name"
)
return conn
# create connection pool
pool = sqlalchemy.create_engine(
"mysql+pymysql://",
creator=getconn,
)
# insert statement
insert_stmt = sqlalchemy.text(
"INSERT INTO my_table (id, title) VALUES (:id, :title)",
)
# interact with Cloud SQL database using connection pool
with pool.connect() as db_conn:
# insert into database
db_conn.execute(insert_stmt, id="book1", title="Book One")
# query database
result = db_conn.execute("SELECT * from my_table").fetchall()
# Do something with the results
for row in result:
print(row)
Let me know if you run into any issues! There is also an interactive Cloud SQL Notebook that will walk your through things in more detail you can check out.

Grant LOAD from S3 with PyMySQL 1.0.2 not working

I'm using ansible with community.mysql.mysql_user to automate database user creation on AWS aurora. So far all the grants have been working fine, however a new requirement for "Load from S3" which is specific to mysql on AWS does not show up after it is issued.
I've reproduced this with only pymysql(see below) which the ansible module uses and I get the same result. I do not see any errors on the database, and the rest of the grants show up as expected.
PyMySQL 1.0.2
CPython 3.9.7
docker: python:3.9.7-slim-buster
If anyone can provide a fix/shed some light/alternatives please let me know otherwise I'll keep digging.
import pymysql.cursors
# Connect to the database
connection = pymysql.connect(host='some_aurora_mysql_5.7_host',
user='some_user',
password='redacted',
database='redacted',
cursorclass=pymysql.cursors.DictCursor,
ssl = {
'ssl': {
'activate': True
}
}
)
with connection:
with connection.cursor() as cursor:
# Create a new record
sql = "GRANT SELECT,LOAD FROM S3 ON `some_table`.* TO 'some_user'#'%' "
cursor.execute(sql)
connection.commit()
with connection.cursor() as cursor:
# Read a single record
sql = "show grants for some_user"
cursor.execute(sql)
result = cursor.fetchone()
print(result)
As it turns out, the LOAD FROM S3 is on the whole database server/cluster, not on individual databases.
So:
GRANT LOAD FROM S3 ON *.* TO 'test_user'#'%' works fine.

How to INSERT INTO Azure SQL database from Azure Databricks in Python

Since pyodbc cannot be installed to Azure databricks, I am trying to use jdbc to insert data into Azure SQL database by Python, but I can find sample code for that.
jdbcHostname = "xxxxxxx.database.windows.net"
jdbcDatabase = "yyyyyy"
jdbcPort = 1433
#jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2};user={3};password={4}".format(jdbcHostname, jdbcPort, jdbcDatabase, username, password)
jdbcUrl = "jdbc:sqlserver://{0}:{1};database={2}".format(jdbcHostname, jdbcPort, jdbcDatabase)
connectionProperties = {
"user" : jdbcUsername,
"password" : jdbcPassword,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
pushdown_query = "(INSERT INTO test (a, b) VALUES ('val_a', 'val_b')) insert_test"
Please advise how to write insertion code in Python.
Thanks.
If I may add, you should also be able to use a Spark data frame to insert to Azure SQL. Just use the connection string you get from Azure SQL.
connectionString = "<Azure SQL Connection string>"
data = spark.createDataFrame([(val_a, val_b)], ["a", "b"])
data.write.jdbc(connectionString, "<TableName>", mode="append")
Since pyodbc cannot be installed to Azure databricks
Actually, it seems you could install pyodbc in databricks.
%sh
apt-get -y install unixodbc-dev
/databricks/python/bin/pip install pyodbc
For more details, you could refer to this answer and this blog.
Pigging backing on Jon ... This is what I used to write data from a Azure databricks dataframe to a Azure SQL Database:
Hostname = "YOUR_SERVER.database.windows.net"
Database = "YOUR_DB"
port = 1433
UN = 'YOUR_USERNAME'
PW = 'YOUR_PASSWORD'
Url = "jdbc:sqlserver://{0}:{1};database={2};user={3};password= {4}".format(Hostname, Port, Database, UN, PW)
df.write.jdbc(Url, "schema.table", mode="append")

How to connect Amazon Redshift to python

This is my python code and I want to connect my Amazon Redshift database to Python, but it is showing error in host.
Can anyone tell me the correct syntax? Am I passing all the parameters correctly?
con=psycopg2.connect("dbname = pg_table_def, host=redshifttest-icp.cooqucvshoum.us-west-2.redshift.amazonaws.com, port= 5439, user=me, password= secret")
This is the error:
OperationalError: could not translate host name "redshift://redshifttest-xyz.cooqucvshoum.us-west-2.redshift.amazonaws.com," to address: Unknown host
It appears that you wish to run Amazon Redshift queries from Python code.
The parameters you would want to use are:
dbname: This is the name of the database you entered in the Database name field when the cluster was created.
user: This is you entered in the Master user name field when the cluster was created.
password: This is you entered in the Master user password field when the cluster was created.
host: This is the Endpoint provided in the Redshift management console (without the port at the end): redshifttest-xyz.cooqucvshoum.us-west-2.redshift.amazonaws.com
port: 5439
For example:
con=psycopg2.connect("dbname=sales host=redshifttest-xyz.cooqucvshoum.us-west-2.redshift.amazonaws.com port=5439 user=master password=secret")
Old question but I just arrived here from Google.
The accepted answer doesn't work with SQLAlchemy, although it's powered by psycopg2:
sqlalchemy.exc.ArgumentError: Could not parse rfc1738 URL from string 'dbname=... host=... port=... user=... password=...'
What worked:
create_engine(f"postgresql://{REDSHIFT_USER}:{REDSHIFT_PASSWORD}#{REDSHIFT_HOST}:{REDSHIFT_PORT}/{REDSHIFT_DATABASE}")
Which works with psycopg2 directly too:
psycopg2.connect(f"postgresql://{REDSHIFT_USER}:{REDSHIFT_PASSWORD}#{REDSHIFT_HOST}:{REDSHIFT_PORT}/{REDSHIFT_DATABASE}")
Using the postgresql dialect works because Amazon Redshift is based on PostgreSQL.
Hope it can help other people!
To connect to redshift, you need the
postgres+psycopg2
Install it as
For Python 3.x:
pip3 install psycopg2-binary
And then use
return create_engine(
"postgresql+psycopg2://%s:%s#%s:%s/%s"
% (REDSHIFT_USERNAME, urlquote(REDSHIFT_PASSWORD), REDSHIFT_HOST, RED_SHIFT_PORT,
REDSHIFT_DB,)
)
Well, for Redshift the idea is made COPY from S3, is faster than every different way, but here is some example to do it:
first you must install some dependencies
for linux users
sudo apt-get install libpq-dev
for mac users
brew install libpq
install with pip this dependencies
pip3 install psycopg2-binary
pip3 install sqlalchemy
pip3 install sqlalchemy-redshift
import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<<
DATABASE = "dwtest"
USER = "youruser"
PASSWORD = "yourpassword"
HOST = "dwtest.awsexample.com"
PORT = "5439"
SCHEMA = "public"
S3_FULL_PATH = 's3://yourbucket/category_pipe.txt'
ARN_CREDENTIALS = 'arn:aws:iam::YOURARN:YOURROLE'
REGION = 'us-east-1'
############ CONNECTING AND CREATING SESSIONS ############
connection_string = "redshift+psycopg2://%s:%s#%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###########################################################
############ RUNNING COPY ############
copy_command = '''
copy category from '%s'
credentials 'aws_iam_role=%s'
delimiter '|' region '%s';
''' % (S3_FULL_PATH, ARN_CREDENTIALS, REGION)
s.execute(copy_command)
s.commit()
######################################
############ GETTING DATA ############
query = "SELECT * FROM category;"
rr = s.execute(query)
all_results = rr.fetchall()
def pretty(all_results):
for row in all_results :
print("row start >>>>>>>>>>>>>>>>>>>>")
for r in row :
print(" ---- %s" % r)
print("row end >>>>>>>>>>>>>>>>>>>>>>")
pretty(all_results)
s.close()
######################################
The easiest way to query AWS Redshift from python is through this Jupyter extension - Jupyter Redshift
Not only can you query and save your results but also write them back to the database from within the notebook environment.

How to Load Data into Amazon Redshift via Python Boto3?

In Amazon Redshift's Getting Started Guide, data is pulled from Amazon S3 and loaded into an Amazon Redshift Cluster utilizing SQLWorkbench/J. I'd like to mimic the same process of connecting to the cluster and loading sample data into the cluster utilizing Boto3.
However in Boto3's documentation of Redshift, I'm unable to find a method that would allow me to upload data into Amazon Redshift cluster.
I've been able to connect with Redshift utilizing Boto3 with the following code:
client = boto3.client('redshift')
But I'm not sure what method would allow me to either create tables or upload data to Amazon Redshift the way it's done in the tutorial with SQLWorkbenchJ.
Right, you need psycopg2 Python module to execute COPY command.
My code looks like this:
import psycopg2
#Amazon Redshift connect string
conn_string = "dbname='***' port='5439' user='***' password='***' host='mycluster.***.redshift.amazonaws.com'"
#connect to Redshift (database should be open to the world)
con = psycopg2.connect(conn_string);
sql="""COPY %s FROM '%s' credentials
'aws_access_key_id=%s; aws_secret_access_key=%s'
delimiter '%s' FORMAT CSV %s %s; commit;""" %
(to_table, fn, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY,delim,quote,gzip)
#Here
# fn - s3://path_to__input_file.gz
# gzip = 'gzip'
cur = con.cursor()
cur.execute(sql)
con.close()
I used boto3/psycopg2 to write CSV_Loader_For_Redshift
Go back to step 4 in that tutorial you linked. See where it shows you how to get the URL of the cluster? You have to connect to that URL with a PostgreSQL driver. The AWS SDKs such as Boto3 provide access to the AWS API. You need to connect to Redshift over a PostgreSQL API, just like you would connect to a PostgreSQL database on RDS.
Using psycopyg2 & get_cluster_credentials
Prerequisites -
IAM ROLE attached to respective User
IAM Role with get_cluster_credentials policy LINK
On cloud (EC2) with appropriate IAM Role attached
The below code will work only if you deploying it on a PC/VM where a user's AWS Credentials are already configured [ CLI - aws configure ] OR
you are on an instance in the same Account,VPC.
Have a config.ini file -
[Redshift]
port = 3389
username = please_enter_username
database_name = please_database-name
cluster_id = please_enter_cluster_id_name
url = please_enter_cluster_endpoint_url
region = us-west-2
My Redshift_connection.py
import logging
import psycopg2
import boto3
import ConfigParser
def db_connection():
logger = logging.getLogger(__name__)
parser = ConfigParser.ConfigParser()
parser.read('config.ini')
RS_PORT = parser.get('Redshift','port')
RS_USER = parser.get('Redshift','username')
DATABASE = parser.get('Redshift','database_name')
CLUSTER_ID = parser.get('Redshift','cluster_id')
RS_HOST = parser.get('Redshift','url')
REGION_NAME = parser.get('Redshift','region')
client = boto3.client('redshift',region_name=REGION_NAME)
cluster_creds = client.get_cluster_credentials(DbUser=RS_USER,
DbName=DATABASE,
ClusterIdentifier=CLUSTER_ID,
AutoCreate=False)
try:
conn = psycopg2.connect(
host=RS_HOST,
port=RS_PORT,
user=cluster_creds['DbUser'],
password=cluster_creds['DbPassword'],
database=DATABASE
)
return conn
except psycopg2.Error:
logger.exception('Failed to open database connection.')
print "Failed"
Query Execution script -
from Redshift_Connection import db_connection
def executescript(redshift_cursor):
query = "SELECT * FROM <SCHEMA_NAME>.<TABLENAME>"
cur=redshift_cursor
cur.execute(query)
conn = db_connection()
conn.set_session(autocommit=False)
cursor = conn.cursor()
executescript(cursor)
conn.close()

Categories

Resources