AWS Glue Data moving from S3 to Redshift - python

I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue. I could move only few tables. Rest of them are having data type issue. Redshift is not accepting some of the data types. I resolved the issue in a set of code which moves tables one by one:
table1 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name="table1"
)
table1 = table1.resolveChoice(
specs=[
("column1", "cast:char"),
("column2", "cast:varchar"),
("column3", "cast:varchar"),
]
)
table1 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=table1,
catalog_connection="redshift",
connection_options={"dbtable": "schema1.table1", "database": "db1"},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="table1",
)
The same script is used for all other tables having data type change issue.
But, As I would like to automate the script, I used looping tables script which iterate through all the tables and write them to redshift. I have 2 issues related to this script.
Unable to move the tables to respective schemas in redshift.
Unable to add if condition in the loop script for those tables which needs data type change.
client = boto3.client("glue", region_name="us-east-1")
databaseName = "db1_g"
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables["TableList"]
for table in tableList:
tableName = table["Name"]
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name=tableName, transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=datasource0,
catalog_connection="redshift",
connection_options={
"dbtable": tableName,
"database": "schema1.db1",
},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="datasink4",
)
job.commit()
Mentioning redshift schema name along with tableName like this: schema1.tableName is throwing error which says schema1 is not defined.
Can anybody help in changing data type for all tables which requires the same, inside the looping script itself?

So the first problem is fixed rather easily. The schema belongs into the dbtable attribute and not the database, like this:
connection_options={
"dbtable": f"schema1.{tableName},
"database": "db1",
}
Your second problem is that you want to call resolveChoice inside of the for Loop, correct? What kind of error occurs there? Why doesn't it work?

Related

Create Query string for AWS Timestream ScheduledQuery using Python CDK

I created the database and table like this:
# Amazon Timestream
self.database = timestream.CfnDatabase(scope=self, id="MyDatabase")
self.table = timestream.CfnTable(
scope=self,
id="MyTable",
database_name=self.database.ref,
)
But somehow I can not get a valid query string, since I somehow can not get the correct table name.
query = "".join(
(
"SELECT * ",
f'FROM "DATABASE.TABLE" ',
)
)
If I use self.database.ref it gives me the correct database name. So far so good.
But how do I get the correct table name?
What I tried so far was:
Give the table a name.
Use database.table.ref, but it gives me "DATABASE|TABLE"
Tried to f'"{database.table.ref.replace("|", '"."')}"

how to fetch multiple tables using spark sql

I am fetching data from mysql using pyspark which for only one table.I want to fetch all tables from mysql db. Don't want call jdbc connection again and again. see code below
Is it possible to simplify my code? Thank you in advance
url = "jdbc:mysql://localhost:3306/dbname"
table_df=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df, "table1")
table_df_1=sqlContext.read.format("jdbc").option("url",url).option("dbtable","table_name_1").option("user","root").option("password", "root").load()
sqlContext.registerDataFrameAsTable(table_df_1, "table2")
you need somehow to acquire the list of the tables you have in mysql.
Either you find some sql commands to do that, or you manually create a file containing everything.
Then, assuming you can create a list of tablenames in python tablename_list, you can simply loop over it like this :
url = "jdbc:mysql://localhost:3306/dbname"
reader = (
sqlContext.read.format("jdbc")
.option("url", url)
.option("user", "root")
.option("password", "root")
)
for tablename in tablename_list:
reader.option("dbtable", tablename).load().createTempView(tablename)
This will create a temporary view with the same tablename. If you want another name, you can probably change the initial tablename_list with a list of tuples (tablename_in_mysql, tablename_in_spark).
#Steven already gave a perfect answer. As he said, in order to find a Python list of tablenames, you can use:
#list of the tables in the server
table_names_list = spark.read.format('jdbc'). \
options(
url='jdbc:postgresql://localhost:5432/', # database url (local, remote)
dbtable='information_schema.tables',
user='YOUR_USERNAME',
password='YOUR_PASSWORD',
driver='org.postgresql.Driver'). \
load().\
filter("table_schema = 'public'").select("table_name")
#DataFrame[table_name: string]
# table_names_list.collect()
# [Row(table_name='employee'), Row(table_name='bonus')]
table_names_list = [row.table_name for row in table_names_list.collect()]
print(table_names_list)
# ['employee', 'bonus']
Note that this is in PostgreSQL. You can easily change url and driver arguments.

Auto-schema for time-partitioned tables in BigQuery

I am trying to append data to a time-partitioned table. We can create a time-partitioned table as follows:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('my_partitioned_table')
schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField('date', 'DATE')
]
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field='date', # name of column to use for partitioning
expiration_ms=7776000000) # 90 days
table = client.create_table(table)
print('Created table {}, partitioned on column {}'.format(
table.table_id, table.time_partitioning.field))
I was wondering however do to the following without pre-defining the schema as I am looking for a generic way to append new data.
When I remove the schema in the example above I get the error that a time partioned table requires a pre-defined schema. However, my files have changed over time meaning that I cannot and do no not want to redefine my schema (I will use Google DataPrep to clean it afterwards).
How I can solve it?
You can update the schema of table when you append new data into it. The two supported schema updates are adding new fields and relaxing a required filed to optional. Search for schemaUpdateOptions this help page:

Python script to query database and placing the file in S3 - Design issue

I am writing a python script to query about 60 database tables based on a current timestamp and store those as csv file in S3 bucket. There are some global variables that I need access to like engine, aws credentials, current_time etc. I have this file currently as 60 functions each querying a table which then calls a function to write into s3.
How do I organize this code better so I won't have to call these 60 functions from the main function?
More importantly, how do I also organize this code following OOP. I am very new to this and any help would be greatly appreciated.
This is what my current code looks like:
import (bunch of imports)
engine = create_engine('sqlite:///bookdatabase.db', echo=False)
access_key = 'adasdasdasdasd'
access_id = 'asdasdasd'
def table_name():
table_name = 'book'
sql = "select * from book where modified_date < current_date"
mn = pandas.read_sql(sql, engine)
# write_to_s3
def another_table_name():
# .....
# etc. etc.
Functions that do the same thing, only with a single variance are a clue that really those actions can be combined into a better structure.
In your case, you are doing the same thing (calling a database, and updating a bucket), the difference is you call different databases, and read different tables.
So why not create a function like this:
S3_ACCESS_KEY = '....'
S3_ACCESS_ID = '....'
def export_to_s3(db_configuration):
for db, tables in db_configuration.items():
engine = create_engine('sqlite://{}'.format(db), echo=False)
for table_name in tables:
sql = "SELECT * FROM {} WHERE modified_date <
current_date".format(table_name)
cursor = engine.cursor()
cursor.execute(sql)
for result in cursor:
# push result to s3
db_table_names = {'bookdatabase.db': ['book'],
'another.db': ['fruits', 'planets']}
export_to_s3(db_table_names)

How do I insert my Python dictionary into my SQL Server database table?

I have a dictionary with 3 keys which correspond to field names in a SQL Server table. The values of these keys come from an excel file and I store this dictionary in a dataframe which I now need to insert into a SQL table. This can all be seen in the code below:
import pandas as pd
import pymssql
df=[]
fp = "file path"
data = pd.read_excel(fp,sheetname ="CRM View" )
row_date = data.loc[3, ]
row_sita = "ABZPD"
row_event = data.iloc[12, :]
df = pd.DataFrame({'date': row_date,
'sita': row_sita,
'event': row_event
}, index=None)
df = df[4:]
df = df.fillna("")
print(df)
My question is how do I insert this dictionary into a SQL table now?
Also, as a side note, this code is part of a loop which needs to go through several excel files one by one, insert the data into dictionary then into SQL then delete the data in the dictionary and start again with the next excel file.
You could try something like this:
import MySQLdb
# connect
conn = MySQLdb.connect("127.0.0.1","username","passwore","table")
x = conn.cursor()
# write
x.execute('INSERT into table (row_date, sita, event) values ("%d", "%d", "%d")' % (row_date, sita, event))
# close
conn.commit()
conn.close()
You might have to change it a little based on your SQL restrictions, but should give you a good start anyway.
For the pandas dataframe, you can use the pandas built-in method to_sql to store in db. Following is the way to use it.
import sqlalchemy as sa
params = urllib.quote_plus("DRIVER={};SERVER={};DATABASE={};Trusted_Connection=True;".format("{SQL Server}",
"<db_server_url>",
"<db_name>"))
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = sa.create_engine(conn_str)
df.to_sql(<table_name>, engine,schema=<schema_name>, if_exists="append", index=False)
For this method you you will need to install sqlalchemy package.
pip install sqlalchemy
You will also need to setup the MSSql DSN on the machine.

Categories

Resources