How to raise exception when column name too long in Postgres + SQLalchemy? - python

I have a script in python that uploads files to a Postgres database server. These files are then converted to SQL tables. For this, I'm using the SQLalchemy library.
The problem arises when the column names are too long. I don't want Postgres to truncate the column names automatically when they exceed the maximum length (if I recall correctly, it's 63 in Postgres). The tables end up having columns with unintelligible names and I would just rather have the script to cancel the upload.
The obvious solution is to just "hardcode" in my script the maximum length and just raise an exception if someone tries to upload a table with "too long" column names. Nevertheless, I think that this should be configurable in SQLalchemy as, for example, it raises an exception when the table name is already in use in the database.
Extract from my script to upload table:
from SQLalchemy import (
create_engine,
)
import pandas as pd
DB_CONFIG_DICT = {
'user': "user",
'host': "urlforhost.com",
'port': 5432,
'password': "password"
}
DB_CONN_FORMAT = "postgresql+psycopg2://{user}:{password}#{host}:{port}/{database}"
DB_CONN_URI_DEFAULT = (DB_CONN_FORMAT.format( database='sandbox', **DB_CONFIG_DICT))
engine = create_engine(DB_CONN_URI_DEFAULT)
path = "file.csv"
table_name = "table_name"
df = pd.read_csv(path, decimal=r".")
df.columns = [c.lower() for c in df.columns] #postgres doesn't like capitals or spaces
df.to_sql(table_name, engine)

I hope this can help you.
def check_column_name(name):
if len(name) > 63:
raise ValueError("column name (%s) is too long" % name)
df.columns = [c.lower() for c in df.columns]
map(check_column_name, df.columns) # Check the column name before import
df.to_sql(table_name, engine)

Related

Snowflake: SQL compilation error: error line invalid identifier '"dateutc"'

I'm moving data from Postgres to snowflake. Originally it worked however I've added:
df_postgres["dateutc"]= pd.to_datetime(df_postgres["dateutc"])
because the date format was incorrectly loading to snowflake and now I see this error:
SQL compilation error: error line 1 at position 87 invalid identifier
'"dateutc"'
Here is my code:
from sqlalchemy import create_engine
import pandas as pd
import glob
import os
from config import postgres_user, postgres_pass, host,port, postgres_db, snow_user, snow_pass,snow_account,snow_warehouse
from snowflake.connector.pandas_tools import pd_writer
from snowflake.sqlalchemy import URL
from sqlalchemy.dialects import registry
registry.register('snowflake', 'snowflake.sqlalchemy', 'dialect')
engine = create_engine(f'postgresql://{postgres_user}:{postgres_pass}#{host}:{port}/{postgres_db}')
conn = engine.connect()
#reads query
df_postgres = pd.read_sql("SELECT * FROM rok.my_table", conn)
#dropping these columns
drop_cols=['RPM', 'RPT']
df_postgres.drop(drop_cols, inplace=True, axis=1)
#changed columns to lowercase
df_postgres.columns = df_postgres.columns.str.lower()
df_postgres["dateutc"]= pd.to_datetime(df_postgres["dateutc"])
print(df_postgres.dateutc.dtype)
sf_conn = create_engine(URL(
account = snow_account,
user = snow_user,
password = snow_pass,
database = 'test',
schema = 'my_schema',
warehouse = 'test',
role = 'test',
))
df_postgres.to_sql(name='my_table',
index = False,
con = sf_conn,
if_exists = 'append',
chunksize = 300,
method = pd_writer)
Moving Ilja's answer from comment to answer for completeness:
Snowflake is case sensitive.
When writing "unquoted" SQL, Snowflake will convert table names and fields to uppercase.
This usually works, until someone decides to start quoting their identifiers in SQL.
pd_writer adds quotes to identifiers.
Hence when you have df_postgres["dateutc"] it remains in lowercase when its transformed into a fully quoted query.
Writing df_postgres["DATEUTC"] in Python should fix the issue.

Sqlalchemy: add into mysql table new rows from pandas dataframe, if they don't exist already in the table

I created a table inserting data fetched from an api and store in to a pandas dataframe using sqlalchemy.
I am gonna need to query the api, every 4 hours, to get new data.
Problem being that the api, will give me back not only the new data but as well the old ones, already imported in mysql
how can i import just the new data into the mysql table
i retrieved the data from the api, stored the data in to a pandas object, created the connection to the mysql db and created a fresh new table.
import requests
import json
from pandas.io.json import json_normalize
myToken = 'xxx'
myUrl = 'somewebsite'
head = {'Authorization': 'token {}'.format(myToken)}
response = requests.get(myUrl, headers=head)
data=response.json()
#print(data.dumps(data, indent=4, sort_keys=True))
results=json_normalize(data['results'])
results.rename(columns={'datastream.name': 'datastream_name',
'datastream.url':'datastream_url',
'datastream.datastream_type_id':'datastream_id',
'start':'error_date'}, inplace=True)
results_final=pd.DataFrame([results.datastream_name,
results.datastream_url,
results.error_date,
results.datastream_id,
results.message,
results.type_label]).transpose()
from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw#ip/schema')
con = engine.connect()
results_final.to_sql(name='error',con=con,if_exists='replace')
con.close()
End goal is to insert into the table, just the not existing data coming from the api
You could pull the results already in the database into a new dataframe and then compare the two dataframes. After that you would only insert the rows not in the table. Not knowing the format of your table or data I'm just using a generic SELECT statement here.
from sqlalchemy import create_engine
from sqlalchemy import exc
engine = create_engine('mysql://usr:psw#ip/schema')
con = engine.connect()
sql = "SELECT * FROM table_name"
old_results = pd.read_sql(sql, con)
df = pd.merge(old_results, results_final, how='outer', indicator=True)
new_results = df[df['_merge']=='right_only'][results_final.columns]
new_results.to_sql(name='error',con=con,if_exists='append')
con.close()
You also need to change if_exists to append because set to replace it drops all values in the table and replaces them with the values in the pandas dataframe.
I developed this function to handle both: news values and when columns from the source table and target table are not equal.
def load_data(df):
engine = create_engine('mysql+pymysql://root:pass#localhost/dw', echo_pool=True, pool_size=10, max_overflow=20)
with engine.connect() as conn, conn.begin():
try:
df_old = pd.read_sql('SELECT * FROM table', conn)
# Check if exists new rows to be inserted
if len(df) > len(df_saved) or df.disconnected_time.max() > df_saved.disconnected_time.max():
print("There are new rows to be inserted. ")
df_merged = pd.merge(df_old, df, how='outer', indicator=True)
df_final = df_merged[df_merged['_merge']=='right_only'][df.columns]
df_final.to_sql(name='table',con=conn,index=False, if_exists='append')
except Exception as err:
print (str(err))
else:
# This handling errors when the lengths of the columns are not equal to the target
if df_bulbr.shape[1] > df_old.shape[1]:
data = pd.read_sql('SELECT * FROM table', conn)
df2 = pd.concat([df,data])
df2.to_sql('table', conn, index=False, if_exists='replace')
outcome = conn.execute("select count(1) from table")
countRow = outcome.first()[0]
return print(f" Total of {countRow} rows load." )

Pandas dataframe to PostgreSQL table using psycopg2 without SQLAlchemy?

I'd like to write a Pandas dataframe to PostgreSQL table without using SQLAlchemy.
The table name should correspond to the pandas variable name, or replace the table if already exists. Data types need to match as well.
I'd like to avoid SQLAlchemy's to_sql function for several reasons.
import pandas as pd
from getpass import getpass
import psycopg2
your_pass = getpass(prompt='Password: ', stream=None)
conn_cred = {
'host': your_host,
'port': your_port,
'dbname': your_dbname,
'user': your_user,
'password': your_pass
}
conn = psycopg2.connect(**conn_cred)
conn.autocommit = True
my_data = {'col1': [1, 2], 'col2': [3, 4]}
def store_dataframe_to_postgre(df, schema, active_conn):
# df = pandas dataframe to store as a table
# schema = schema for the table
# active_conn = open connection to a PostgreSQL db
# ...
# Bonus: require explicit commit here, even though conn.autocommit = True
store_dataframe_to_postgre(my_data, 'my_schema', conn)
This should be the result in the Postgre db:
SELECT * FROM my_schema.my_data;
col1 col2
1 3
2 4
you can try but this code in your:
cursor = conn.cursor()
cur.copy_from(df, schema , null='', sep=',', columns=(my_data))
reference code:
copy dataframe to postgres table with column that has defalut value

Writing dataframe to postgres database

I want to write a pandas dataframe to a postgres table. I make a connection to db as follows:
import psycopg2
import pandas as pd
import sqlalchemy
def connect(user, password, db, host='localhost', port=5432):
'''Returns a connection and a metadata object'''
url = 'postgresql://{}:{}#{}:{}/{}'
url = url.format(user, password, host, port, db)
# The return value of create_engine() is our connection object
con = sqlalchemy.create_engine(url, client_encoding='utf8')
# We then bind the connection to MetaData()
meta = sqlalchemy.MetaData(bind=con, reflect=True)
return con, meta
con, meta = connect('user_name', 'password', 'db_name', host='host_name')
When I read from a table that is already populated, it works fine:
df = pd.read_sql("SELECT * FROM db.table_name limit 10",con=con)
print df
I would like to be able to write df to a table. To test this, I have a temporary table called 'test' with two fields name and age.
# create a temp df
table = [['name', 'age'], ['nameA' , 20], ['nameB', 30]]
headers = table.pop(0)
df = pd.DataFrame(table, columns=headers)
# write to db
df.to_sql('db.test', con, if_exists = 'replace', index=False)
I then check if the temp table is populated:
df = pd.read_sql("SELECT * FROM db.test limit 10",con=con)
print df
I get an empty dataframe! I got no errors when I use df.to_sql but nothing is getting written to the database (?). What am I missing and how do I go about fixing this?
Versions:
Pandas: 0.19.2
Sqlachemy: 1.1.10
Postgres: 9.4.9
I have not figured out why df.to_sql did not write to the table. Writing to table using pd.io.sql.SQLDatabase worked for my test case:
meta = sqlalchemy.MetaData(con, schema='db_name')
meta.reflect()
pdsql = pd.io.sql.SQLDatabase(con, meta=meta)
pdsql.to_sql(df, 'test', if_exists='replace')
I would not consider this THE solution -- I'd be happy to accept better solution or an answer that brings a closure to why df.to_sql() does not behave as expected.

How to copy entire SQL Server table into CSV including column headers?

Summary
I have a Python program (2.7) that connects to a SQL Server database using SQLAlchemy. I want to copy the entire SQL table into a local CSV file (including column headers). I'm new to SQLAlchemy (version .7) and so far I'm able to dump the entire csv file, but I have to explicitly list my column headers.
Question
How do I copy an entire SQL table into a local CSV file (including column headers)? I don't want to explicitly type in my column headers. The reason is that I want to avoid changing the code if there's changes in the table's columns.
Code
import sqlalchemy
# Setup connection info, assume database connection info is correct
SQLALCHEMY_CONNECTION = (DB_DRIVER_SQLALCHEMY + '://'
+ DB_UID + ":" + DB_PWD + "#" + DB_SERVER + "/" + DB_DATABASE
)
engine = sqlalchemy.create_engine(SQLALCHEMY_CONNECTION, echo=True)
metadata = sqlalchemy.MetaData(bind=engine)
vw_AllCenterChatOverview = sqlalchemy.Table( \
'vw_AllCenterChatOverview', metadata, autoload=True)
metadata.create_all(engine)
conn = engine.connect()
# Run the SQL Select Statement
result = conn.execute("""SELECT * FROM
[LifelineChatDB].[dbo].[vw_AllCenterChatOverview]""")
# Open file 'output.csv' and write SQL query contents to it
f = csv.writer(open('output.csv', 'wb'))
f.writerow(['StartTime', 'EndTime', 'Type', 'ChatName', 'Queue', 'Account',\
'Operator', 'Accepted', 'WaitTimeSeconds', 'PreChatSurveySkipped',\
'TotalTimeInQ', 'CrisisCenterKey']) # Where I explicitly list table headers
for row in result:
try:
f.writerow(row)
except UnicodeError:
print "Error running this line ", row
result.close()
Table Structure
In my example, 'vw_AllCenterChatOverview' is the table. Here's the Table Headers:
StartTime, EndTime, Type, ChatName, Queue, Account, Operator, Accepted, WaitTimeSeconds, PreChatSurveySkipped, TotalTimeInQ, CrisisCenterKey
Thanks in advance!
Use ResultProxy.keys:
# Run the SQL Select Statement
result = conn.execute("""SELECT * FROM
[LifelineChatDB].[dbo].[vw_AllCenterChatOverview]""")
# Get column names
column_names = result.keys()

Categories

Resources