Snowflake pandas pd_writer writes out tables with NULLs - python

I have a Pandas dataframe that I'm writing out to Snowflake using SQLAlchemy engine and the to_sql function. It works fine, but I have to use the chunksize option because of some Snowflake limit. This is also fine for smaller dataframes. However, some dataframes are 500k+ rows, and at a 15k records per chunk, it takes forever to complete writing to Snowflake.
I did some research and came across the pd_writer method provided by Snowflake, which apparently loads the dataframe much faster. My Python script does complete faster and I see it creates a table with all the right columns and the right row count, but every single column's value in every single row is NULL.
I thought it was a NaN to NULL issue and tried everything possible to replace the NaNs with None, and while it does the replacement within the dataframe, by the time it gets to the table, everything becomes NULL.
How can I use pd_writer to get these huge dataframes written properly into Snowflake? Are there any viable alternatives?
EDIT: Following Chris' answer, I decided to try with the official example. Here's my code and the result set:
import os
import pandas as pd
from snowflake.sqlalchemy import URL
from sqlalchemy import create_engine
from snowflake.connector.pandas_tools import write_pandas, pd_writer
def create_db_engine(db_name, schema_name):
return create_engine(
URL(
account=os.environ.get("DB_ACCOUNT"),
user=os.environ.get("DB_USERNAME"),
password=os.environ.get("DB_PASSWORD"),
database=db_name,
schema=schema_name,
warehouse=os.environ.get("DB_WAREHOUSE"),
role=os.environ.get("DB_ROLE"),
)
)
def create_table(out_df, table_name, idx=False):
engine = create_db_engine("dummy_db", "dummy_schema")
connection = engine.connect()
try:
out_df.to_sql(
table_name, connection, if_exists="append", index=idx, method=pd_writer
)
except ConnectionError:
print("Unable to connect to database!")
finally:
connection.close()
engine.dispose()
return True
df = pd.DataFrame([("Mark", 10), ("Luke", 20)], columns=["name", "balance"])
print(df.head)
create_table(df, "dummy_demo_table")
The code works fine with no hitches, but when I look at the table, which gets created, it's all NULLs. Again.

Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.
My workaround is to simply do: df.columns = map(str.upper, df.columns) before invoking to_sql.

I have had this exact same issue, don't despair there is a solution in sight. When you create a table in snowflake, from the snowflake worksheet or snowflake environment, it names the object and all columns and constraints in uppercase. However when you create the table from Python using the data frame, the object gets created in the exact case that you specified in your data frame. In your case it is columns=['name', 'balance']). So when the insert happens, it looks for all uppercase column names in snowflake and cannot find it, it does the insert but sets your 2 columns to null as the columns are created as nullable.
Best way to get pass this issue is to create your columns in uppercase in the dataframe, columns=['NAME', 'BALANCE']).
I do think this is something that snowflake should address and fix as it is not an expected behavior.
Even if you tried to do a select from your table that has nulls you would get an error eg:
select name, balance from dummy_demo_table
You would probably get an error like the following,
SQL compilation error: error line 1 at position 7 invalid identifier 'name'
BUT the following will work
SELECT * from dummy_demo_table

Related

Problems while inserting df values with python into Oracle db

I am having troubles when trying to insert data from a df into an Oracle database table, this is the error: DatabaseError: ORA-01036: illegal variable name/number
These are the steps I did:
This is the dataframe I have imported from yfinance package and elaborated in order to respect the integrity of the data types of my df
I transformed my df into a list, these are my data in the list:
this is the table where I want to insert my data:
This is the code:
sql_insert_temp = "INSERT INTO TEMPO('GIORNO','MESE','ANNO') VALUES(:2,:3,:4)"
index = 0
for i in df.iterrows():
cursor.execute(sql_insert_temp,df_list[index])
index += 1
connection.commit()
I have tried a single insert in the sqldeveloper worksheet, using the data you can see in the list, and it worked, so I guess I have made some mistake in the code. I have seen other discussions, but I couldn't find any solution to my problem.. Do you have any idea of how I can solve this or maybe is it possible to do this in another way?
I have tried to print the iterated queries and that's the result, that's why it's not inserting my data:
If you already have a pandas DataFrame, then you should be able to use the to_sql() method provided by the pandas library.
import cx_Oracle
import sqlalchemy
import pandas as pd
DATABASE = 'DB'
SCHEMA = 'DEV'
PASSWORD = 'password'
connection_string = f'oracle://{SCHEMA}:{PASSWORD}#{DATABASE}'
db_conn = sqlalchemy.create_engine(connection_string)
df_to_insert = df[['GIORNO', 'MESE', 'ANNO']] #creates a dataframe with only the columns you want to insert
df_to_insert.to_sql(name='TEMPO', con=db_connection, if_exists='append')
name is the name of the table
con is the connection object
if_exists='append' will add the rows to end of the table. There are other options to add fail or drop and re-create the table
other parameters can be found on the pandas website. pandas.to_sql()

write dataframe from jupyter notebook to snowflake without define table column type

I have a data frame in jupyter notebook. My objective is to import this df into snowflake as a new table.
Is there any way to write a new table into snowflake directly without defining any table columns' names and types?
i am using
import snowflake.connector as snow
from snowflake.connector.pandas_tools import write_pandas
from sqlalchemy import create_engine
import pandas as pd
connection = snowflake.connector.connect(
user='XXX',
password='XXX',
account='XXX',
warehouse='COMPUTE_WH',
database= 'SNOWPLOW',
schema = 'DBT_WN'
)
df.to_sql('aaa', connection, index = False)
it ran into an error:
DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': not all arguments converted during string formatting
Can anyone provide the sample code to fix this issue?
Here's one way to do it -- apologies in advance for my code formatting in SO combined with python's spaces vs tabs "model". Check the tabs/spaces if you cut-n-paste ...
Because of the Snowsql security model, in your connection parameters be sure to specify the ROLE you are using as well. (Often the default role is 'PUBLIC')
Since you already have sqlAlchemy in the mix ... this idea doesn't use the snowflake write_pandas, so it isn't a good answer for large dataframes ... Some odd behaviors with sqlAlchemy and Snowflake; make sure the dataframe column names are upper case; yet use a lowercase table name in the argument to to_sql() ...
def df2sf_alch(target_df, target_table):
# create a sqlAlchemy connection object
engine = create_engine(f"snowflake://{your-sf-account-url}",
creator=lambda:connection)
# re/create table in Snowflake
try:
# sqlAlchemy creates table based on a lower-case table name
# and it works to have uppercase df column names
target_df.to_sql(target_table.lower(), con=engine, if_exists='replace', index=False)
print(f"Table {target_table.upper()} re/created")
except Exception as e:
print(f"Could not replace table {target_table.upper()}", exc_info=1)
nrows = connection.cursor().execute(f"select count(*) from {target_table}").fetchone()[0]
print(f"Table {target_table.upper()} rows = {nrows}")
Note this function needs to be changed to reflect the appropriate 'snowflake account url' in order to create the sqlAlchemy connection object. Also, assuming the case naming oddities are taken care of in the df, along with your already defined connection, you'd call this function simply passing the df and the name of the table, like df2sf_alch(my_df, 'MY_TABLE')

Reading Data from Temp Table in Snowflake into Jupyter Notebook

I am trying to query data from Snowflake into a Jupyter Notebook. Since some columns were not present in the original table, I did create a temporary table which had the required new columns. Unfortunately, due to work restrictions, I couldn't show the whole output here. But when I did run the CREATE TEMPORARY TABLE command, got the following output.
Table CUSTOMER_ACCOUNT_NEW successfully created.
Here is the query I used to make the TEMP table.
CREATE OR REPLACE TEMPORARY TABLE DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT_NEW AS
SELECT ID,
VERIFICATION_PROFILE,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults')::VARCHAR AS identitymind,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults."mm:1"')::VARCHAR AS mm1,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults."mm:2"')::VARCHAR AS mm2,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults.res')::VARCHAR AS res,
get_path(VERIFICATION_PROFILE,'identityMindMostRecentResults."ss:1"')::VARCHAR AS sanctions,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.account.riskScore')::VARCHAR AS riskscore,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.giact.verificationResponse')::VARCHAR AS GIACT,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.account.type')::VARCHAR AS acct_type,
get_path(VERIFICATION_PROFILE,'autoVerified.verified')::VARCHAR AS verified,
get_path(VERIFICATION_PROFILE,'bankInformationProvided')::VARCHAR AS Bank_info_given,
get_path(VERIFICATION_PROFILE,'businessInformationProvided')::VARCHAR AS Business_info_given,
get_path(VERIFICATION_PROFILE,'autoVerified.facts.account.industry.riskLevel')::VARCHAR AS industry_risk
FROM DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT
WHERE DATEDIFF('day',TO_DATE(TIME_UPDATED),CURRENT_DATE())<=90
I would like to mention that VERIFICATION_PROFILE is a JSON blob, hence I had to use get_path to retrieve the values. Moreover, keys like mm:1 are originally in double quotes, so I did use it as it is, and it is working fine in snowflake.
Then using snowflake connector python, I did try to run following query;
import pandas as pd
import warnings
import snowflake.connector as sf
ctx = sf.connect(
user='*****',
password='*****',
account='*******',
warehouse='********',
database='DATA_LAKE',
schema='CUSTOMER'
)
#create cursor
curs = ctx.cursor()
sqlnew2 = "SELECT * \
FROM DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT_NEW;"
curs.execute(sqlnew2)
df = curs.fetch_pandas_all()
Here curs is the cursor object created earlier. Then I got the following message;
ProgrammingError: 002003 (42S02): SQL compilation error:
Object 'DATA_LAKE.CUSTOMER.CUSTOMER_ACCOUNT_NEW' does not exist or not authorized.
May I know does snowflake connector allow us to query data from temporary tables pr not? Help/advice is greatly appreciated.
Temp tables only live as long as the session they were created lives:
Temporary tables can have a Time Travel retention period of 1 day; however, a temporary table is purged once the session (in which the table was created) ends so the actual retention period is for 24 hours or the remainder of the session, whichever is shorter.
You might want to use a transient table instead:
https://docs.snowflake.com/en/user-guide/tables-temp-transient.html#comparison-of-table-types

Error trying to save new table to MySQL with sqlalchemy

Hi does somebody has any troubleshooting ideas to solve this problem?
I have a standard python-sql connection at my local machine:
from sqlalchemy import create_engine
engine = create_engine("mysql+pymysql://root:*******#localhost/my_DB")
con = engine.connect()
this DB consists of 200+ tables where I store stock/market information and I need to update it daily, in order to do that I usually construct a loop through all the tables to fetch up to date information from yahoo_finance using pandas datareader.
Once loaded into a new DF I use
df_new.to_sql(name = stock_ticker, con = con, if_exists = 'replace', index = False)
to save the new table into my DB.
The code above works just fine when I execute one by one, but when I try to implement the same idea on a loop it just breaks, sometimes on the very first instance of the loop:
for stock in Stocks:
df_new = yahoo_quote(stock)
df_new.to_sql(name = stock_ticker, con = con, if_exists = 'replace', index = False)
My first thought was that somehow I was exhuasting my machine/sql with so many calls, so I tried to add a time.sleep(5) and make sure I erased all the information from memory on each instance, but none of that seems to work. And, as I said, sometimes the computer just breaks on the very first loop.
By "break" I mean that it just keeps running forever without saving the table, usually it takes little less than 1 second to save a table, but when this happens I can leave it running for 10+ minutes and it still won't save it.
if_exists= 'replace' option is drop the table before inserting new values. API reference
your code repeats drop and create same Table in the loop.
If you want to replace all data, first time call df_new.to_sql set if_exists= 'replace', and second time call set if_exists= 'append'.

pandas/sqlalchemy/pyodbc: Result object does not return rows from stored proc when UPDATE statement appears before SELECT

I'm using SQL Server 2014, pandas 0.23.4, sqlalchemy 1.2.11, pyodbc 4.0.24, and Python 3.7.0. I have a very simple stored procedure that performs an UPDATE on a table and then a SELECT on it:
CREATE PROCEDURE my_proc_1
#v2 INT
AS
BEGIN
UPDATE my_table_1
SET v2 = #v2
;
SELECT * from my_table_1
;
END
GO
This runs fine in MS SQL Server Management Studio. However, when I try to invoke it via Python using this code:
import pandas as pd
from sqlalchemy import create_engine
if __name__ == "__main__":
conn_str = 'mssql+pyodbc://#MODEL_TESTING'
engine = create_engine(conn_str)
with engine.connect() as conn:
df = pd.read_sql_query("EXEC my_proc_1 33", conn)
print(df)
I get the following error:
sqlalchemy.exc.ResourceClosedError: This result object does not return
rows. It has been closed automatically.
(Please let me know if you want full stack trace, I will update if so)
When I remove the UPDATE from the stored proc, the code runs and the results are returned. Note also that selecting from a table other than the one being updated does not make a difference, I get the same error. Any help is much appreciated.
The issue is that the UPDATE statement is returning a row count, which is a scalar value, and the rows returned by the SELECT statement are "stuck" behind the row count where pyodbc cannot "see" them (without additional machinations).
It is considered a best practice to ensure that our stored procedures always start with a SET NOCOUNT ON; statement to suppress the returning of row count values from DML statements (UPDATE, DELETE, etc.) and allow the stored procedure to just return the rows from the SELECT statement.
For me I got the same issue for another reason, I was using sqlachmey the newest syntax select to get the entries of a table and I had forgot to write the name of the table class I want to get values from, so I got this error, so I had only added the name of the table as an argument to fix the error.
the code leaded to the error
query = select().where(Assessment.created_by == assessment.created_by)
simply fix it by adding the table class name sometimes issues are only in the syntax hhh
query = select(Assessment).where(Assessment.created_by == assessment.created_by)

Categories

Resources