Airflow: BigQuery SQL Insert empty data to the table - python

Using Airflow, I am trying to get the data from one table to insert it into another in BigQuery. I have 5 origin tables and 5 destination tables. My SQL query and python logic work for the 4 tables where it successfully gets the data and inserts it into their respective destination tables, but it doesn't work for 1 table.
query = '''SELECT * EXCEPT(eventdate) FROM `gcp_project.gcp_dataset.gcp_table_1`
WHERE id = "1234"
AND eventdate = "2023-01-18"
'''
# Delete the previous destination tables if existed
bigquery_client.delete_table("gcp_project.gcp_dataset.dest_gcp_table_1", not_found_ok=True)
job_config = bigquery.QueryJobConfig()
table_ref = bigquery_client.dataset(gcp_dataset).table(dest_gcp_table_1)
job_config.destination = table_ref
job_config.write_disposition = bigquery.WriteDisposition.WRITE_TURNCATE
# Start the query, passing in the extra configuration.
query_job = bigquery_client.query(query=query,
location='US',
job_config=job_config
)
#check if the table is successfully written
while not query_job.done():
time.sleep(1)
logging.info("Data is written into a destination table with {} number of rows for id {}."
.format(query_job.result().total_rows, id))
I have even tried using the SQL query with CREATE OR REPLACE but the result was still the same table_1 is coming as empty. I have also tried BigQueryInsertJobOperator, but table_1 still comes empty.
Note: Size of the Table_1 data is around 270 MB with 1463306 rows, it is also the biggest out of all the tables data when it comes to inserting it into another table
I tried to execute the above logic from my local machine and it works fine for table_1 as well, I see the data in GCP BigQuery.
I am not sure why and what's happening behind this. Does anyone have any idea why this happening or what can it cause?

Found the root cause for this, the previous query which is responsible for populating the origin table was still running in the GCP BigQuery backend. Because of that the above query did get any data.
Solution: introduced query_job.result() This will wait for the job to be complete and then execute the next query.

Related

How to do SQL Server transactions with Python through SQLAlchemy without blocking access to db tables in SQL Server Management Studio?

I've been stuck for a while now on this question and wasn't able to find the right answer/topic on the internet.
Basically, I have a table on SQL Server that I try to 'replace' with an updated one in a pandas dataframe form. I need to do a transaction for this task so that my original table isn't lost if something goes wrong while transferring data from dataframe (rollback functionality). I found a solution for this - SQLAlchemy library.
My code for this:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0")
with engine.begin() as conn:
df.to_sql(name = 'table_name', schema = 'db_schema', con = conn, if_exists = 'replace', index = False)
The problem occurs when I try to access the tables in this specific database through SQL Server Management Studio 18 while doing this transaction because it somehow manages to block the whole database and no one can access any tables in it (access time limits exceeded). The code above works great, I've tried to transfer a small chunk of dataframe, but the problem still persists, because I need to transfer a large dataframe.
What I've tried:
The concept of isolation levels, but this isn't the right thing as it's about the rules of connecting to a table that is already being used.
Example:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0", isolation_level="SERIALIZABLE")
Adjusting such parameters as pool_size and max_overflow in create_engine() statement and chunksize in df.to_sql() statement but they don't seem to have an effect. Example:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0", pool_size = 1, max_overflow = 0)
with engine.begin() as conn: df.to_sql(name = 'table_name', schema = 'db_schema', con = conn, if_exists = 'replace', chunksize = 1, index = False)
Excluding schema parameter from df.to_sql() query doesn't work either
Basic SQL code and functionality I'm trying to achieve for this task would look something like this:
BEGIN TRANSACTION
BEGIN TRY
DELETE FROM [db].[schema].[table];
INSERT INTO [db].[schema].[table] <--- dataframe
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
SELECT ERROR_NUMBER() AS [Error_number], ERROR_MESSAGE() AS [Error_description]
END CATCH
I could create another buffer table and parse df data into it and do a transaction of this table afterwards, but I'm looking for a solution to bypass these steps.
If there is a better way to do this task please let me know as well.
As suggested by #GordThompson, the right solution considering that your db table already exists is as follows:
engine = create_engine("mssql+pyodbc://server_name:password#user/database?driver=SQL+Server+Native+Client+11.0")
# start transaction
with engine.begin() as conn:
# clean the table
conn.exec_driver_sql("TRUNCATE TABLE [db].[schema].[table]")
# append data from df
df.to_sql(name = 'table_name', schema = 'schema_name', con = conn, if_exists = 'append', index = False)

Adding a column from an existing BQ table to another BQ table using Python

I am trying to experiment with creating new tables from existing BQ tables, all within python. So far I've successfully created the table using some similar code, but now I want to add another column to it from another table - which I have not been successful with. I think the problem comes somewhere within my SQL code.
Basically what I want here is to add another column named "ip_address" and put all the info from another table into that column.
I've tried splitting up the two SQL statements and running them separately, I've tried many different combinations of the commands (taking our CHAR, adding (32) after, combining all into one statement, etc.), and still I run into problems.
from google.cloud import bigquery
def alter(client, sql_alter, job_config, table_id):
query_job = client.query(sql_alter, job_config=job_config)
query_job.result()
print(f'Query results appended to table {table_id}')
def main():
client = bigquery.Client.from_service_account_json('my_json')
table_id = 'ref.datasetid.tableid'
job_config = bigquery.QueryJobConfig()
sql_alter = """
ALTER TABLE `ref.datasetid.tableid`
ADD COLUMN ip_address CHAR;
INSERT INTO `ref.datasetid.tableid` ip_address
SELECT ip
FROM `ref.datasetid.table2id`;
"""
alter(client, sql_alter, job_config, table_id)
if __name__ == '__main__':
main()
With this code, the current error is "400 Syntax error: Unexpected extra token INSERT at [4:9]" Also, do I have to continuously reference my table with ref.datasetid.tableid or can I write just tableid? I've run into errors before it gets there so I'm still not sure. Still a beginner so help is greatly appreciated!
BigQuery does not support ALTER TABLE or other DDL statements, take a look into how Modifying table schemas there you can find an example of how to add a new column when you append data to a table during a load job.

Using SQL Alchemy to Import Data and Replace Given a Condition

Below is the last part of my selenium web scraper that loops through the different tabs of this website page, selects the "export data" button, downloads the data, adds a "yearid" column, then loads the data into a MySQL table.
df = pd.read_csv(desired_filepath)
df["yearid"] = datetime.today().year
df[df.columns[df.columns.str.contains('%')]] = \
(df.filter(regex='%')
.apply(lambda x: pd.to_numeric(x.str.replace(r'[\s%]', ''),
errors='coerce')))
df.to_csv(desired_filepath)
engine = create_engine("mysql+pymysql://{user}:{pw}#localhost/{db}"
.format(user="walker",
pw="password",
db="data"))
df.to_sql(con=engine, name='fg_test_hitting_{}'.format(button_text), if_exists='replace')
time.sleep(10)
driver.quit()
Everything works great, but I would like to import the data into the MySQL table and replace only if the yearid=2018. Does anyone know if it is possible to load data and replace given a specific condition? Thanks in advance!
I think rather than deleting from your table it may be better to just let MySQL handle the replacing. You can do this by creating a temporary table with the new data, replace into the permanent table, then delete the temp table. The big caveat here is that you will need to set the keys in your table (Ideally only once). I don't know what your key fields are so its tough to help in this regard.
Replace the commented line with this:
# df.to_sql(con=engine, name='fg_test_hitting_{}'.format(button_text), if_exists='replace')
conn = engine.connect()
# should fail if temporary table already exists (we want it to fail in this case)
df.to_sql('fg_test_hitting_{}_tmp'.format(button_text), conn)
# Will create the permanent table if it does not already exist (will only matter in the first run)
# note that you may have to create keys here so that mysql knows what constitutes a replacement
conn.execute('CREATE TABLE IF NOT EXISTS fg_test_hitting_{} LIKE fg_test_hitting_{}_tmp;'.format(button_text, button_text))
# updating the permanent table and dropping the temporary table
conn.execute('REPLACE INTO fg_test_hitting_{} (SELECT * FROM fg_test_hitting_{}_tmp);'.format(button_text, button_text))
conn.execute('DROP TABLE IF EXISTS fg_test_hitting_{}_tmp;'.format(button_text))
As described by #Leo in comments first delete that part of data (from MySQL table) that you were going to update and then save it to MySQL table:
conn = engine.connect()
cur = conn.cursor()
...
cur.execute('delete from fg_test_hitting_{} where yearid=?'.format(button_text),
(pd.datetime.today().year,))
df.to_sql(con=engine, name='fg_test_hitting_{}'.format(button_text), if_exists='replace')

BigQuery insert job instead of streaming

I am currently using BigQuery's stream option to load data into tables. However, tables that have date partition on do not show any partitions... I am aware of this being an effect of the streaming.
The Python code I use:
def stream_data(dataset_name, table_name, data):
bigquery_client = bigquery.Client()
dataset = bigquery_client.dataset(dataset_name)
table = dataset.table(table_name)
# Reload the table to get the schema.
table.reload()
rows = data
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}:{}'.format(dataset_name, table_name))
else:
print('Errors:')
print(errors)
Will date partitioned tables eventually show and if no, how can I create an insert job to realize this?
Not sure what you mean by "partitions not being shown" but when you create a partitioned table you will only see one single table.
The only difference here is that you can query in this table for date partitions, like so:
SELECT
*
FROM
mydataset.partitioned_table
WHERE
_PARTITIONTIME BETWEEN TIMESTAMP('2016-12-25')
AND TIMESTAMP('2016-12-31');
As you can see in this example, partitioned tables have the meta column _PARTITIONTIME and that's what you use to select the partitions you are interested in.
For more info, here are the docs explaining a bit more about querying data in partitioned tables.

How to update records in SQL Alchemy in a Loop

I am trying to use SQLSoup - the SQLAlchemy extention, to update records in a SQL Server 2008 database. I am using pyobdc for the connections. There are a number of issues which make it hard to find a relevant example.
I am reprojection a geometry field in a very large table (2 million + records), so many of the standard ways of updating fields cannot be used. I need to extract coordinates from the geometry field to text, convert them and pass them back in. All this is fine, and all the individual pieces are working.
However I want to execute a SQL Update statement on each row, while looping through the records one by one. I assume this places locks on the recordset, or the connection is in use - as if I use the code below it hangs after successfully updating the first record.
Any advice on how to create a new connection, reuse the existing one, or accomplish this another way is appreciated.
s = select([text("%s as fid" % id_field),
text("%s.STAsText() as wkt" % geom_field)],
from_obj=[feature_table])
rs = s.execute()
for row in rs:
new_wkt = ReprojectFeature(row.wkt)
update_value = "geometry :: STGeomFromText('%s',%s)" % (new_wkt, "3785")
update_sql = ("update %s set GEOM3785 = %s where %s = %i" %
(full_name, update_value, id_field, row.fid))
conn = db.connection()
conn.execute(update_sql)
conn.close() #or not - no effect..
Updated working code now looks like this. It works fine on a few records, but hangs on the whole table, so I guess it is reading in too much data.
db = SqlSoup(conn_string)
#create outer query
Session = sessionmaker(autoflush=False, bind=db.engine)
session = Session()
rs = session.execute(s)
for row in rs:
#create update sql...
session.execute(update_sql)
session.commit()
I now get connection busy errors.
DBAPIError: (Error) ('HY000', '[HY000] [Microsoft][ODBC SQL Server Driver]Connection is busy with results for another hstmt (0) (SQLExecDirectW)')
It looks like this could be a problem with the ODBC driver - http://sourceitsoftware.blogspot.com/2008/06/connection-is-busy-with-results-for.html
Further Update:
On the server using profiler, it shows the select statement then the first update statement are "starting" but neither complete.
If I set the Select statement to return the top 10 rows, then it does complete and the updates run.
SQL: Batch Starting Select...
SQL: Batch Starting Update...
I believe this is an issue with pyodbc and SQL Server drivers. If I remove SQL Alchemy and execute the same SQL with pyodbc it also hangs. Even if I create a new connection object for the updates.
I also tried the SQL Server Native Client 10.0 driver which is meant to allow MARS - Multiple Active Record Sets but it made no difference. In the end I have resorted to "paging the results" and updating these batches using pyodbc and SQL (see below), however I thought SQLAlchemy would have been able to do this for me automatically.
Try using a Session.
rs = s.execute() then becomes session.execute(rs) and you can replace the last three lines with session.execute(update_sql). I'd also suggest configuring your Session with autocommit off and call session.commit() at the end.
Can I suggest that when your process hangs you do a sp_who2 on the Sql box and see what is happening. Check for blocked spid's and see if you can find anything in the Sql code that can suggest what is happening. If you do find a spid that is blocking others you can do a dbcc inputbuffer(*spidid*) and see if that tells you what the query was it executed. Otherwise you can also attach the Sql profiler and trace your calls.
In some cases it could also be parallelism on the Sql server that cause blocks. Unless this is a data warehouse, I suggest turn your Max DOP off, (set it to 1). Let me know and when I check this again in the morning and you need help, I'll be glad to help.
Until I find another solution I am using a single connection and custom SQL to return sets of records, and updating these in batches. I don't think what I am doing is a particulary unique case, so I am not sure why I cannot handle multiple result sets simultaneously.
Below works but is very, very slow..
cnxn = pyodbc.connect(conn_string, autocommit=True)
cursor = cnxn.cursor()
#get total recs in the database
s = "select count(fid) as count from table"
count = cursor.execute(s).fetchone().count
#choose number of records to update in each iteration
batch_size = 100
for i in range(1,count, batch_size):
#sql to bring back relevant records in each batch
s = """SELECT fid, wkt from(select ROW_NUMBER() OVER(ORDER BY FID ASC) AS 'RowNumber'
,FID
,GEOM29902.STAsText() as wkt
FROM %s) features
where RowNumber >= %i and RowNumber <= %i""" % (full_name,i,i+batch_size)
rs = cursor.execute(s).fetchall()
for row in rs:
new_wkt = ReprojectFeature(row.wkt)
#...create update sql statement for the record
cursor.execute(update_sql)
counter += 1
cursor.close()
cnxn.close()

Categories

Resources