Sink processed stream data into a database using Apache-flink - python

Is it possible to sink processed stream data into a database using pyflink? All methods to write processed data are limited to save them in the txt, csv or Json formats and there is no way to sink data with database.

You could use SQL DDL within pyflink to define a JDBC table sink that you can then insert into. That will look something like this
my_sink_ddl = """
CREATE TABLE MyUserTable (
id BIGINT,
name STRING,
age INT,
status BOOLEAN,
PRIMARY KEY (id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://localhost:3306/mydatabase',
'table-name' = 'users'
);
"""
t_env.sql_update(my_sink_ddl)

Related

AWS Glue Data moving from S3 to Redshift

I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue. I could move only few tables. Rest of them are having data type issue. Redshift is not accepting some of the data types. I resolved the issue in a set of code which moves tables one by one:
table1 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name="table1"
)
table1 = table1.resolveChoice(
specs=[
("column1", "cast:char"),
("column2", "cast:varchar"),
("column3", "cast:varchar"),
]
)
table1 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=table1,
catalog_connection="redshift",
connection_options={"dbtable": "schema1.table1", "database": "db1"},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="table1",
)
The same script is used for all other tables having data type change issue.
But, As I would like to automate the script, I used looping tables script which iterate through all the tables and write them to redshift. I have 2 issues related to this script.
Unable to move the tables to respective schemas in redshift.
Unable to add if condition in the loop script for those tables which needs data type change.
client = boto3.client("glue", region_name="us-east-1")
databaseName = "db1_g"
Tables = client.get_tables(DatabaseName=databaseName)
tableList = Tables["TableList"]
for table in tableList:
tableName = table["Name"]
datasource0 = glueContext.create_dynamic_frame.from_catalog(
database="db1_g", table_name=tableName, transformation_ctx="datasource0"
)
datasink4 = glueContext.write_dynamic_frame.from_jdbc_conf(
frame=datasource0,
catalog_connection="redshift",
connection_options={
"dbtable": tableName,
"database": "schema1.db1",
},
redshift_tmp_dir=args["TempDir"],
transformation_ctx="datasink4",
)
job.commit()
Mentioning redshift schema name along with tableName like this: schema1.tableName is throwing error which says schema1 is not defined.
Can anybody help in changing data type for all tables which requires the same, inside the looping script itself?
So the first problem is fixed rather easily. The schema belongs into the dbtable attribute and not the database, like this:
connection_options={
"dbtable": f"schema1.{tableName},
"database": "db1",
}
Your second problem is that you want to call resolveChoice inside of the for Loop, correct? What kind of error occurs there? Why doesn't it work?

DataFrame.to_sql equivalent of using "RETURNING id" with psycopg2?

When inserting data using psycopg2 I can retrieve the id of the inserted row using a RETURNING PostgreSQL statement:
import psycopg2
conn = my_connection_parameters()
curs = conn.cursor()
sql_insert_data_query = (
"""INSERT INTO public.data
(created_by, comment)
VALUES ( %(user)s, %(comment)s )
RETURNING id; # the id is automatically managed by the database.
"""
)
curs.execute(
sql_insert_data_query,
{
"user": 'me',
"comment": 'my comment'
}
)
conn.commit()
data_id = curs.fetchone()[0]
and that's great because I need this id to write other data to, e.g. an associative table.
But when having a large dictionary to write to PostgreSQL (which keys are the column identifiers), it's more convenient to rely on pandas' DataFrame.to_sql() method:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql+psycopg2://', creator=my_connection_parameters)
df = pd.DataFrame(my_dict, index=[0]) # this is a "one-row" DataFrame, each column being created from the dict keys
df.to_sql(
name='user_table',
con=engine,
schema='public',
if_exists='append',
index=False
)
but there is no direct way to retrieve the id PostgreSQL has created when this record was actually inserted.
Is there a nice and reliable workaround to get it?
Or should I stick with psycopg2 to write my large dictionary using a SQL query?

Skip forbidden rows from a BigQuery query, using Python

I need to download a relatively small table from BigQuery and store it (after some parsing) in a Panda dataframe .
Here is the relevant sample of my code:
from google.cloud import bigquery
client = bigquery.Client(project="project_id")
job_config = bigquery.QueryJobConfig(allow_large_results=True)
query_job = client.query("my sql string", job_config=job_config)
result = query_job.result()
rows = [dict(row) for row in result]
pdf = pd.DataFrame.from_dict(rows)
My problem:
After a few thousands rows parsed, one of them is too big and I get an exception: google.api_core.exceptions.Forbidden.
So, after a few iterations, I tried to transform my loop to something that looks like:
rows = list()
for _ in range(result.total_rows):
try:
rows.append(dict(next(result)))
except google.api_core.exceptions.Forbidden:
pass
BUT it doesn't work since result is a bigquery.table.RowIterator and despite its name, it's not an iterator... it's an iterable
So... what do I do now? Is there a way to either:
ask for the next row in a try/except scope?
tell bigquery to skip bad rows?
Did you try paging through query results?
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT name, SUM(number) as total_people
FROM `bigquery-public-data.usa_names.usa_1910_2013`
GROUP BY name
ORDER BY total_people DESC
"""
query_job = client.query(query) # Make an API request.
query_job.result() # Wait for the query to complete.
# Get the destination table for the query results.
#
# All queries write to a destination table. If a destination table is not
# specified, the BigQuery populates it with a reference to a temporary
# anonymous table after the query completes.
destination = query_job.destination
# Get the schema (and other properties) for the destination table.
#
# A schema is useful for converting from BigQuery types to Python types.
destination = client.get_table(destination)
# Download rows.
#
# The client library automatically handles pagination.
print("The query data:")
rows = client.list_rows(destination, max_results=20)
for row in rows:
print("name={}, count={}".format(row["name"], row["total_people"]))
Also you can try to filter out big rows in your query:
WHERE LENGTH(some_field) < 123
or
WHERE LENGTH(CAST(some_field AS BYTES)) < 123

Auto-schema for time-partitioned tables in BigQuery

I am trying to append data to a time-partitioned table. We can create a time-partitioned table as follows:
# from google.cloud import bigquery
# client = bigquery.Client()
# dataset_ref = client.dataset('my_dataset')
table_ref = dataset_ref.table('my_partitioned_table')
schema = [
bigquery.SchemaField('name', 'STRING'),
bigquery.SchemaField('post_abbr', 'STRING'),
bigquery.SchemaField('date', 'DATE')
]
table = bigquery.Table(table_ref, schema=schema)
table.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.DAY,
field='date', # name of column to use for partitioning
expiration_ms=7776000000) # 90 days
table = client.create_table(table)
print('Created table {}, partitioned on column {}'.format(
table.table_id, table.time_partitioning.field))
I was wondering however do to the following without pre-defining the schema as I am looking for a generic way to append new data.
When I remove the schema in the example above I get the error that a time partioned table requires a pre-defined schema. However, my files have changed over time meaning that I cannot and do no not want to redefine my schema (I will use Google DataPrep to clean it afterwards).
How I can solve it?
You can update the schema of table when you append new data into it. The two supported schema updates are adding new fields and relaxing a required filed to optional. Search for schemaUpdateOptions this help page:

Running a entire SQL script via python

I'm looking to run the following test.sql located in a folder on my C: drive. I've been playing with cx_Oracle and just can't get it to work.
test.sql contains the following.
CREATE TABLE MURRAYLR.test
( customer_id number(10) NOT NULL,
customer_name varchar2(50) NOT NULL,
city varchar2(50)
);
CREATE TABLE MURRAYLR.test2
( customer_id number(10) NOT NULL,
customer_name varchar2(50) NOT NULL,
city varchar2(50)
);
This is my code:
import sys
import cx_Oracle
connection = cx_Oracle.connect('user,'password,'test.ora')
cursor = connection.cursor()
f = open("C:\Users\desktop\Test_table.sql")
full_sql = f.read()
sql_commands = full_sql.split(';')
for sql_command in sql_commands:
cursor.execute(sql_command)
cursor.close()
connection.close()
This answer is relevant only if your test.sql file contains new lines '\n\' characters (like mine which I got from copy-pasting your sql code). You will need to remove them in your code, if they are present. To check, do
print full_sql
To fix the '\n's,
sql_commands = full_sql.replace('\n', '').split(';')[:-1]
The above should help.
It removes the '\n's and removes the empty string token at the end when splitting the sql string.
MURRAYLR.test is not acceptable table name in any DBMS I've used. The connection object the cx_oracle.connect returns should already have a schema selected. To switch to a different schema set the current_schema field on the connection object or add using <Schemaname>; in your sql file.
Obviously make sure that the schema exists.

Categories

Resources