I'm trying to insert a pandas data frame into snowflake table using sqlalchemy
My dataframe looks like
df =
FRUITS VEGETABLES
0 apple potato
1 banana onion
2 mango beans
My code:
import pandas as pd
import sqlalchemy
from snowflake.connector.pandas_tools import pd_writer
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
# account details
account_identifier = 'account_identifier'
user = 'user_login_name'
password = 'password'
database_name = 'database_name'
schema_name = 'schema_name'
conn_string = f"snowflake://{user}:{password}#{account_identifier}/{database_name}/{schema_name}"
engine = create_engine(conn_string)
table_name = 'my_table'
if_exists = 'append'
if __name__ = '__main__':
df = pd.read_csv('my.csv')
with engine.connect() as con:
df.to_sql(name=table_name.lower(), con=con, if_exists=if_exists, index=False, method=pd_writer)
I'm getting an error:
snowflake.connector.errors.ProgrammingError: SQL compilation error: error line 1 at position 79
invalid identifier '"FRUITS"'
I don't understand why this is giving an error even though my table schema has only two columns
Related
I am working with python trying to connect with postgres, I created a table into my postgres database in the staging schema.
create table staging.data( Name varchar, Age bigint);
then I try to connect and insert my dataframe data into this table:
import psycopg2
import pandas as pd
from sqlalchemy import create_engine
conn_string = 'postgresql://myuser:password#host/database_name'
db = create_engine(conn_string)
conn = db.connect()
# our dataframe
data = {'Name': ['Tom', 'dick', 'harry'],
'Age': [22, 21, 24]}
# Create DataFrame
df = pd.DataFrame(data)
df.to_sql('staging.data', con=conn, if_exists='replace',
index=False)
conn = psycopg2.connect(conn_string
)
conn.autocommit = True
cursor = conn.cursor()
sql1 = '''select * from staging.data;'''
cursor.execute(sql1)
for i in cursor.fetchall():
print(i)
conn.commit()
conn.close()
But the Python ends with no error message, and there is no data into my table from postgres.
Any idea about this?
Regards
I think the issue is that you are trying to use a schema other than public. Try passing in the schema name via the schema argument of to_sql() like this:
df.to_sql('data', con=conn, if_exists='replace', schema='staging', index=False)
The code runs successfully with no errors returned, but only old records displayed:
import pandas as pd
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
from config import config
engine = create_engine(URL(account=config.account,
user=config.username,
password=config.password,
warehouse=config.warehouse,
database=config.database,
schema=config.schema,))
conn = engine.connect()
df = pd.DataFrame([('AAA', '1234'), ('BBB', '5678')], columns=['name', 'pswd'])
df.to_sql('demo_db.public.test_f1', con=engine, index=False, if_exists='append', index_label=None)
df = pd.read_sql_query('select * from demo_db.public.test_f1', conn)
print(df.head(5))
conn.close()
engine.dispose()
Please help!
It seems the 3 part name was treated as single identifier and data was inserted into table called "demo_db.public.test_f1":
SELECT * FROM demo_db.public."demo_db.public.test_f1";
The name could be provided as table name only and database/schema are inferred from connection:
df.to_sql('test_f1', con=engine, index=False, if_exists='append', index_label=None)
I'm moving data from Postgres to snowflake. Originally it worked however I've added:
df_postgres["dateutc"]= pd.to_datetime(df_postgres["dateutc"])
because the date format was incorrectly loading to snowflake and now I see this error:
SQL compilation error: error line 1 at position 87 invalid identifier
'"dateutc"'
Here is my code:
from sqlalchemy import create_engine
import pandas as pd
import glob
import os
from config import postgres_user, postgres_pass, host,port, postgres_db, snow_user, snow_pass,snow_account,snow_warehouse
from snowflake.connector.pandas_tools import pd_writer
from snowflake.sqlalchemy import URL
from sqlalchemy.dialects import registry
registry.register('snowflake', 'snowflake.sqlalchemy', 'dialect')
engine = create_engine(f'postgresql://{postgres_user}:{postgres_pass}#{host}:{port}/{postgres_db}')
conn = engine.connect()
#reads query
df_postgres = pd.read_sql("SELECT * FROM rok.my_table", conn)
#dropping these columns
drop_cols=['RPM', 'RPT']
df_postgres.drop(drop_cols, inplace=True, axis=1)
#changed columns to lowercase
df_postgres.columns = df_postgres.columns.str.lower()
df_postgres["dateutc"]= pd.to_datetime(df_postgres["dateutc"])
print(df_postgres.dateutc.dtype)
sf_conn = create_engine(URL(
account = snow_account,
user = snow_user,
password = snow_pass,
database = 'test',
schema = 'my_schema',
warehouse = 'test',
role = 'test',
))
df_postgres.to_sql(name='my_table',
index = False,
con = sf_conn,
if_exists = 'append',
chunksize = 300,
method = pd_writer)
Moving Ilja's answer from comment to answer for completeness:
Snowflake is case sensitive.
When writing "unquoted" SQL, Snowflake will convert table names and fields to uppercase.
This usually works, until someone decides to start quoting their identifiers in SQL.
pd_writer adds quotes to identifiers.
Hence when you have df_postgres["dateutc"] it remains in lowercase when its transformed into a fully quoted query.
Writing df_postgres["DATEUTC"] in Python should fix the issue.
I created a table inserting data fetched from an api and store in to a pandas dataframe using sqlalchemy.
I am gonna need to query the api, every 4 hours, to get new data.
Problem being that the api, will give me back not only the new data but as well the old ones, already imported in mysql
how can i import just the new data into the mysql table
i retrieved the data from the api, stored the data in to a pandas object, created the connection to the mysql db and created a fresh new table.
import requests
import json
from pandas.io.json import json_normalize
myToken = 'xxx'
myUrl = 'somewebsite'
head = {'Authorization': 'token {}'.format(myToken)}
response = requests.get(myUrl, headers=head)
data=response.json()
#print(data.dumps(data, indent=4, sort_keys=True))
results=json_normalize(data['results'])
results.rename(columns={'datastream.name': 'datastream_name',
'datastream.url':'datastream_url',
'datastream.datastream_type_id':'datastream_id',
'start':'error_date'}, inplace=True)
results_final=pd.DataFrame([results.datastream_name,
results.datastream_url,
results.error_date,
results.datastream_id,
results.message,
results.type_label]).transpose()
from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw#ip/schema')
con = engine.connect()
results_final.to_sql(name='error',con=con,if_exists='replace')
con.close()
End goal is to insert into the table, just the not existing data coming from the api
You could pull the results already in the database into a new dataframe and then compare the two dataframes. After that you would only insert the rows not in the table. Not knowing the format of your table or data I'm just using a generic SELECT statement here.
from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw#ip/schema')
con = engine.connect()
sql = "SELECT * FROM table_name"
old_results = pd.read_sql(sql, con)
df = pd.merge(old_results, results_final, how='outer', indicator=True)
new_results = df[df['_merge']=='right_only'][results_final.columns]
new_results.to_sql(name='error',con=con,if_exists='append')
con.close()
You also need to change if_exists to append because set to replace it drops all values in the table and replaces them with the values in the pandas dataframe.
I developed this function to handle both: news values and when columns from the source table and target table are not equal.
def load_data(df):
engine = create_engine('mysql+pymysql://root:pass#localhost/dw', echo_pool=True, pool_size=10, max_overflow=20)
with engine.connect() as conn, conn.begin():
try:
df_old = pd.read_sql('SELECT * FROM table', conn)
# Check if exists new rows to be inserted
if len(df) > len(df_saved) or df.disconnected_time.max() > df_saved.disconnected_time.max():
print("There are new rows to be inserted. ")
df_merged = pd.merge(df_old, df, how='outer', indicator=True)
df_final = df_merged[df_merged['_merge']=='right_only'][df.columns]
df_final.to_sql(name='table',con=conn,index=False, if_exists='append')
except Exception as err:
print (str(err))
else:
# This handling errors when the lengths of the columns are not equal to the target
if df_bulbr.shape[1] > df_old.shape[1]:
data = pd.read_sql('SELECT * FROM table', conn)
df2 = pd.concat([df,data])
df2.to_sql('table', conn, index=False, if_exists='replace')
outcome = conn.execute("select count(1) from table")
countRow = outcome.first()[0]
return print(f" Total of {countRow} rows load." )
I have a database that contains multiple tables, and I am trying to import each table as a pandas dataframe. I can do this for a single table as follows:
import pandas as pd
import pandas.io.sql as psql
import pypyodbc
conn = pypyodbc.connect("DRIVER={SQL Server};\
SERVER=serveraddress;\
UID=uid;\
PWD=pwd;\
DATABASE=db")
df1 = psql.read_frame('SELECT * FROM dbo.table1', conn)
The number of tables in the database will change, and at any time I would like to be able to import each table into its own dataframe. How can I get all of these tables into pandas?
Depending on your SQL server, you can inspect the tables in a database.
For example:
tables_df = pd.read_sql('SELECT table_name FROM database_name', conn)
Now your table names are accessible as a pandas data frame, you just need to parse it out:
table_name_list = tables_df.table_name
select_template = 'SELECT * FROM {table_name}'
frames_dict = {}
for tname in table_name_list:
query = select_template.format(table_name = tname)
frames_dict[tname] = pd.read_sql(query, conn)
Your dictionary frames_dict contains all the dataframes with the table_name as the key