Google BigQuery - python client - creating/managing jobs - python

I'm new to the BigQuery world... I'm using the python google.cloud package and I need simply to run a query from Python on a BigQuery table and print the results. This is the part of the query function which creates a query job.
function test():
query = "SELECT * FROM " + dataset_name + '.' + table_name
job = bigquery_client.run_async_query('test-job', query)
job.begin()
retry_count = 100
while retry_count > 0 and job.state != 'DONE':
retry_count -= 1
sleep(10)
job.reload() # API call
print(job.state)
print(job.ended)
If I run the test() function multiple times, I get the error:
google.api.core.exceptions.Conflict: 409 POST https://www.googleapis.com/bigquery/v2/projects/myprocject/jobs:
Already Exists: Job myprocject:test-job
Since I have to run the test() function multiple times, do I have to delete the job named 'test-job' each time or do I have to assign a new job-name (e.g. a random one or datetime-based) each time?

do I have to delete the job named 'test-job' each time
You cannot delete job. Jobs collection stores your project's complete job history, but availability is only guaranteed for jobs created in the past six months. The best you can do is to request automatic deletion of jobs that are more than 50 days old for which you should contact support.
or do I have to assign a new job-name (e.g. a random one or datetime-based) each time?
Yes. This is the way to go

As a side recommendation, we usually do it like:
import uuid
job_name = str(uuid.uuid4())
job = bigquery_client.run_async_query(job_name, query)
Notice this is already automatic if you run a synced query.
Also, you don't have to manage the validation for job completeness (as of version 0.27.0), if you want you can use it like:
job = bigquery_client.run_async_query(job_name, query)
job_result = job.result()
query_result = job_result.query_results()
data = list(query_result.fetch_data())

Related

Python rq: handle success or failure of jobs

I have a fairly basic (so far) queue set up in my app:
Job 1 (backup): back up the SQL table I'm about to replace
Job 2 (update): do the actual table drop/update
very simplified code:
from rq import Queue
from rq.decorators import job
#job('backup')
def backup(db, table, conn_str):
backup_sql = "SELECT * INTO {}.dbo.{}_backup from {}.dbo.{}".format(db, table, db, collection)
#job('update')
def update(db, table, conn_str, keys, data):
truncate_sql = "TRUNCATE TABLE {}.dbo.{}".format(db, collection)
sql_cursor.execute(truncate_sql)
for sql_row in data:
sql = "INSERT INTO {}.dbo.{} ({}) values ({})".format(db, table, ",".join(keys), ",".join(["?"] * len(sql_row)))
sql_cursor.execute(sql, sql_row)
sql_cursor.commit()
def update_data():
...
update_queue = Queue('update', connection=redis_conn)
backup_job = update_queue.enqueue('backup', db, table, conn_str, result_ttl=current_app.config['RESULT_TTL'],)
update_job = update_queue.enqueue('update', db, table, conn_str, result_ttl=current_app.config['RESULT_TTL'],)
What I'd like to do, is find a way to watch the update. If it fails, I want to run a job to restore the backup created in the backup job. If it's successful, I want to run a different job to simply remove the backup.
What's the right way to go about this? I'm pretty new to rq and am looking around in the docs, but haven't found either a way to poll update for success/failure or an idiomatic way to handle either outcome.
One option is to create another third job called "checker" for example, which will decide what to do based on the status of "update" job. For that, you have to specify a dependency relationship.
depends_on specifies another job (or job id) that must complete before
this job will be queued.
def checker(*args, **kwargs):
pass
checker_job = update_queue.enqueue('checker', *args, depends_on=update_job.id, result_ttl=current_app.config['RESULT_TTL'])
Then check the status of the dependency inside of "checker" and based on that status load backup or delete it.
def checker(*args, **kwargs):
job = rq.get_current_job()
update_job = job.dependency
if update_job.status == 'failed':
# do the stuff here
else: # or elif
# do the stuff here

Multiple Python scripts running with their own MSAccess database - conflicts

I have a script that creates an access db and populates it with data and queries and performs a compact and repair.
The script gets called via command line via a .bat file, and i need multiple of these scripts running concurrently.
I'm getting an error where basically the it thinks the current database is the database from the other script that concurrently running.
i think i either need to update the code so that it creates a separate instance for each script (which i think it already is) or update it so that it doesn't need to use the OpenCurrentDatabase() method, but dont know what alternative i have? can't get answers via google
if bool(config.query_create_dict):
logging.info("Creating and Executing Queries")
try:
oApp = win32com.client.Dispatch("Access.Application")
# CREATE QUERIES
oApp.OpenCurrentDatabase(config.access_db_filepath)
currentdb = oApp.CurrentDb()
for order_id, query_dict in config.query_create_dict.items():
name = query_dict["Name"]
sql = query_dict["SQL"]
# replace #valuation_date, with month end
sql = sql.replace("#valuation_date", "{}".format(config.valuation_date.strftime("%Y-%m-%d")))
logging.info("Creating query: {}".format(name))
currentdb.CreateQueryDef(name, sql)
# EXECUTE QUERIES
for order_id, name in config.query_execute_dict.items():
logging.info("Running query: {}".format(name))
currentdb.Execute(name)
currentdb = None
oApp.DoCmd.CloseDatabase
except Exception as e:
logging.error(e)
raise e
finally:
currentdb = None
oApp.Quit
oApp = None

Issues with concurrent inserts on Redshift table

I am trying to concurrently process insert/update into a redshift database using a python script on AWS glue. I am using the pg8000 library to do all my database operations. The concurrent insert/update fails with an error Error Name:1023 ,Error State:XX000). While researching the error I found out that the error was related to Serializable Isolation.
Can anyone look at the code and ensure that there would not be clashes while the insert/update happens?
I tried using a random sleep time within the calling class. it worked for a couple of cases but then as the number of workers increased. It failed for an insert/update case.
import sys
import time
import concurrent.futures
import pg8000
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ['TempDir','JOB_NAME','REDSHIFT_HOST','REDSHIFT_PORT','REDSHIFT_DB','REDSHIFT_USER_NAME','REDSHIFT_USER_PASSWORD'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
job_run_id = args['JOB_RUN_ID']
maximum_workers = 5
def executeSql(sqlStmt):
conn = pg8000.connect(database=args['REDSHIFT_DB'],user=args['REDSHIFT_USER_NAME'],password=args['REDSHIFT_USER_PASSWORD'],host=args['REDSHIFT_HOST'],port=int(args['REDSHIFT_PORT']))
conn.autocommit = True
cur = conn.cursor()
cur.execute(sqlStmt)
cur.close()
conn.close()
def executeSqlProcedure(procedureName, procedureArgs = ""):
try:
logProcStrFormat = "CALL table_insert_proc('{}','{}','{}','{}',{},{})"
#Insert into the log table - create the record
executeSql (logProcStrFormat.format(job_run_id,procedureName,'pending','','getdate()','null')) #Code fails here
#Executing the procedure
procStrFormat = "CALL {}({})"
executeSql(procStrFormat.format(procedureName,procedureArgs))
print("Printing from {} process at ".format(procedureName),time.ctime())
#Update the record in log table to complete
executeSql (logProcStrFormat.format(job_run_id,procedureName,'complete','','null','getdate()')) #Code fails here
except Exception as e:
errorMsg = str(e.message["M"])
executeSql (logProcStrFormat.format(job_run_id,procedureName,'failure',errorMsg,'null','getdate()'))
raise
sys.exit(1)
def runDims():
dimProcedures = ["test_proc1","test_proc2","test_proc3","test_proc4","test_proc5"]
with concurrent.futures.ThreadPoolExecutor(max_workers=maximum_workers) as executor:
result = list(executor.map(executeSqlProcedure, dimProcedures))
def runFacts():
factProcedures = ["test_proc6","test_proc7","test_proc8","test_proc9"]
with concurrent.futures.ThreadPoolExecutor(max_workers=maximum_workers) as executor:
result = list(executor.map(executeSqlProcedure, factProcedures))
runDims()
runFacts()
I expect the insert/update to occur into the log table without locking/erroring out
Amazon Redshift does not work well with lots of small INSERT statements.
From Use a Multi-Row Insert - Amazon Redshift:
If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.
Multi-row inserts improve performance by batching up a series of inserts. The following example inserts three rows into a four-column table using a single INSERT statement. This is still a small insert, shown simply to illustrate the syntax of a multi-row insert.
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
Alternatively, output the data to Amazon S3, then perform a bulk load using the COPY command. This will be much more efficient because it can perform the load in parallel across all nodes.

Python 2.7 - MySQL: Query value won't change while running on infinite loop

The idea is to display a message box once the value changes. The COUNT(*) of the query variable is currently 0. While the program is running in an infinite loop to check if there are any changes, a new request is created, and the count should now be 1.
However, this does not happen. The console keeps printing No new Tickets.
Why is the query (row) variable not being updated when there is a change in the background?
If I kill the program and re-run it it will get the new value.
while True:
#if this count increases by 1 send a notification in python that there is a new ticket
query = """
SELECT COUNT(*)#q.id, q.Name, q.Description, t.Created, t.Subject
FROM
Queues as q
LEFT JOIN Tickets as t
on q.id = t.Queue
LEFT JOIN Users as u
on u.id = t.Owner
WHERE
q.id = 1 AND u.id = 10 AND t.Status = 'New'
AND
STR_TO_DATE(t.Created, "%Y-%m-%d") = CURDATE();
"""
cursor = conn.cursor()
cursor.execute(query)
row = cursor.fetchone()
#Debug for printing the current count
print row[0]
#if the current count of unowned and status ="new" tickets is greater than 0 send a notification.
if(row[0] > 0):
ctypes.windll.user32.MessageBoxW(0, u"New Ticket Available", u"NEW TICKET", 1)
else:
print "No new Tickets"
time.sleep(5)
https://dev.mysql.com/doc/connector-python/en/connector-python-api-mysqlconnection-commit.html
...by default Connector/Python does not autocommit, it is important to call this method after every transaction that modifies data for tables that use transactional storage engines.
Furthermore, if your transaction isolation level is the default (REPEATABLE-READ), then subsequent queries on the same transaction will always get the same result, even if new data has been changing in the meantime.
You have two choices:
Call conn.commit() to get a new snapshot of the database at the start of your loop.
set tx_isolation='REPEATABLE-READ'; before you begin your transaction, so your long-running transaction can see the result of concurrent changes automatically.

Python script to query database and placing the file in S3 - Design issue

I am writing a python script to query about 60 database tables based on a current timestamp and store those as csv file in S3 bucket. There are some global variables that I need access to like engine, aws credentials, current_time etc. I have this file currently as 60 functions each querying a table which then calls a function to write into s3.
How do I organize this code better so I won't have to call these 60 functions from the main function?
More importantly, how do I also organize this code following OOP. I am very new to this and any help would be greatly appreciated.
This is what my current code looks like:
import (bunch of imports)
engine = create_engine('sqlite:///bookdatabase.db', echo=False)
access_key = 'adasdasdasdasd'
access_id = 'asdasdasd'
def table_name():
table_name = 'book'
sql = "select * from book where modified_date < current_date"
mn = pandas.read_sql(sql, engine)
# write_to_s3
def another_table_name():
# .....
# etc. etc.
Functions that do the same thing, only with a single variance are a clue that really those actions can be combined into a better structure.
In your case, you are doing the same thing (calling a database, and updating a bucket), the difference is you call different databases, and read different tables.
So why not create a function like this:
S3_ACCESS_KEY = '....'
S3_ACCESS_ID = '....'
def export_to_s3(db_configuration):
for db, tables in db_configuration.items():
engine = create_engine('sqlite://{}'.format(db), echo=False)
for table_name in tables:
sql = "SELECT * FROM {} WHERE modified_date <
current_date".format(table_name)
cursor = engine.cursor()
cursor.execute(sql)
for result in cursor:
# push result to s3
db_table_names = {'bookdatabase.db': ['book'],
'another.db': ['fruits', 'planets']}
export_to_s3(db_table_names)

Categories

Resources