I am using pymssql and the Pandas sql package to load data from SQL into a Pandas dataframe with frame_query.
I would like to send it back to the SQL database using write_frame, but I haven't been able to find much documentation on this. In particular, there is a parameter flavor='sqlite'. Does this mean that so far Pandas can only export to SQLite? My firm is using MS SQL Server 2008 so I need to export to that.
Unfortunately, yes. At the moment sqlite is the only "flavor" supported by write_frame. See https://github.com/pydata/pandas/blob/master/pandas/io/sql.py#L155
def write_frame(frame, name=None, con=None, flavor='sqlite'):
Write records stored in a DataFrame to SQLite. The index will currently be
if flavor == 'sqlite':
schema = get_sqlite_schema(frame, name)
raise NotImplementedError
Writing a simple write_frame should be fairly easy, though. For example, something like this might work (untested!):
import pymssql
conn = pymssql.connect(host='SQL01', user='user', password='password', database='mydatabase')
cur = conn.cursor()
# frame is your dataframe
wildcards = ','.join(['?'] * len(frame.columns))
data = [tuple(x) for x in frame.values]
table_name = 'Table'
cur.executemany("INSERT INTO %s VALUES(%s)" % (table_name, wildcards), data)
Just to save someone else who tried to use this some time. It turns out the line:
wildcards = ','.join(['?'] * len(frame.columns))
should be:
wildcards = ','.join(['%s'] * len(frame.columns))
Hope that helps
I am new to working with SQL and Postgres specifically and am trying to write a simple program that stores a course id and some URLs in an SQL table with two columns. I am using the psycopg2 python library.
I am able to read from the table using:
def get_course_urls(course):
con = open_db_connection()
cur = con.cursor()
query = f"SELECT urls FROM courses WHERE course = '{course}'"
rows = cur.fetchall()
urls = []
for url in rows:
return urls
However, I am unable to insert into the table using:
def format_urls_string(urls):
return '{"' + '","'.join(urls) + '"}'
def add_course_urls(course, urls):
con = open_db_connection()
cur = con.cursor()
query = f"INSERT INTO courses (course, urls) VALUES ('{course}', '{format_urls_string(urls)}');"
add_course_urls("CS136", ["http://google.com", "http://wikipedia.com"])
I do not think anything is wrong with my query because when I run the same query in the SQL Shell it works as I want it to.
The locks on the columns say that the columns are READ-ONLY, however, I am able to insert through the shell. I feel like this is a very minor fix but since I am new to PostgreSQL, I am having some trouble.
Your help is appreciated!
This is the danger of doing the substitution yourself, instead of letting the db connector do it. You looked at your string, yes? You're writing
... VALUES ('CS136', '['http://google.com','http://wikipedia.com']')
which is obviously the wrong syntax. It needs to be
... VALUES ('CS136', '{"http://google.com","http://wikipedia.com"}')
which Python's formatter won't generate. So, you can either format the insertion string by hand, or put placeholders and pass the parameters to the cursor.execute call:
query = "INSERT INTO courses (course, urls) VALUES (%s,%s);"
cur.execute( query, (course, urls) )
I have a sqlite db in my home dir.
stephen#stephen-AO725:~$ pwd
stephen#stephen-AO725:~$ sqlite db1
SQLite version 2.8.17
Enter ".help" for instructions
sqlite> select * from test
...> ;
sqlite> .quit
when I try to connect from a jupiter notebook with sqlalchemy and pandas, sth does not work.
pd.read_sql('select * from db1.test',db)
~/anaconda3/lib/python3.7/site-packages/sqlalchemy/engine/default.py in do_execute(self, cursor, statement, parameters, context)
579 def do_execute(self, cursor, statement, parameters, context=None):
--> 580 cursor.execute(statement, parameters)
582 def do_execute_no_params(self, cursor, statement, context=None):
DatabaseError: (sqlite3.DatabaseError) file is not a database
[SQL: select * from db1.test]
(Background on this error at: http://sqlalche.me/e/4xp6)
I also tried:
same result
Personally, just to complete the code of #Stephen with the modules required:
# 1.-Load module
import sqlalchemy
import pandas as pd
#2.-Turn on database engine
dbEngine=sqlalchemy.create_engine('sqlite:////home/stephen/db1.db') # ensure this is the correct path for the sqlite file.
#3.- Read data with pandas
pd.read_sql('select * from test',dbEngine)
#4.- I also want to add a new table from a dataframe in sqlite (a small one)
df_todb.to_sql(name = 'newTable',con= dbEngine, index=False, if_exists='replace')
Another way to read is using sqlite3 library, which may be more straighforward:
#1. - Load libraries
import sqlite3
import pandas as pd
# 2.- Create your connection.
cnx = sqlite3.connect('sqlite:////home/stephen/db1.db')
cursor = cnx.cursor()
# 3.- Query and print all the tables in the database engine
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
dfN_check = pd.read_sql_query("SELECT * FROM test", cnx) # we need real name of table
# 5.- Now I want to delete all rows of this table
cnx.execute("DELETE FROM test;")
# 6. -COMMIT CHANGES! (mandatory if you want to save these changes in the database)
# 7.- Close the connection with the database
Please let me know if this helps!
import sqlalchemy
Note: that you need three slashes in sqlite:/// in order to use a relative path for the DB. If you want an absolute path, use four slashes: sqlite:////
Source: Link
The issue is no backward compatibility as noted by Everila. anaconda installs its own sqlite, which is sqlite3.x and that sqlite cannot load databases created by sqlite 2.x
after creating a db with sqlite 3 the code works fine
pd.read_sql('select * from test',db)
which confirms the 4 slashes are needed.
None of the sqlalchemy solutions worked for me with python 3.10.6 and sqlalchemy 2.0.0b4, it could be a beta issue or version 2.0.0 changed things. #corina-roca's solution was close, but not right as you need to pass a connection object, not an engine object. That's what the documentation says, but it didn't actually work. After a bit of experimentation, I discovered that engine.raw_connect() works, although you get a warning on the CLI. Here are my working examples
The sqlite one works out of the box - but it's not ideal if you are thinking of changing databases later
import sqlite3
conn = sqlite3.connect("sqlite:////home/stephen/db1")
df = pd.read_sql_query('SELECT * FROM test', conn)
# works, no problem
sqlalchemy lets you abstract your db away
from sqlalchemy import create_engine, text
engine = create_engine("sqlite:////home/stephen/db1")
conn = engine.connect() # <- this is also what you are supposed to
# pass to pandas... it doesn't work
result = conn.execute(text("select * from test"))
for row in result:
print(row) # outside pands, this works - proving that
# connection is established
conn = engine.raw_connection() # with this workaround, it works; but you
# get a warning UserWarning: pandas only
# supports SQLAlchemy connectable ...
df = pd.read_sql_query(sql='SELECT * FROM test', con=conn)
I'm currently trying to query a deltadna database. Their Direct SQL Access guide states that any PostgreSQL ODBC compliant tools should be able to connect without issue. Using the guide, I set up an ODBC data source in windows
I have tried adding Set nocount on, changed various formats for the connection string, changed the table name to be (account).(system).(tablename), all to no avail. The simple query works in Excel and I have cross referenced with how Excel formats everything as well, so it is all the more strange that I get the no query problem.
import pyodbc
conn_str = 'DSN=name'
query1 = 'select eventName from table_name limit 5'
conn = pyodbc.connect(conn_str)
query1_cursor = conn.cursor().execute(query1)
row = query1_cursor.fetchone()
Result is ProgrammingError: No results. Previous SQL was not a query.
Try it like this:
import pyodbc
conn_str = 'DSN=name'
query1 = 'select eventName from table_name limit 5'
conn = pyodbc.connect(conn_str)
query1_cursor = conn.cursor()
row = query1_cursor.fetchone()
You can't do the cursor declaration and execution in the same row. Since then your query1_cursor variable will point to a cursor object which hasn't executed any query.
I am using pyhive to interact with hive.
The SELECT statement going well using this code bellow.
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST")
cur = conn.cursor()
# Import pandas
import pandas as pd
# Store select query in dataframe
all_tables = pd.read_sql("SELECT * FROM table LIMIT 5", conn)
print all_tables
# Using curssor
cur = conn.cursor()
cur.execute('SELECT * FROM table LIMIT 5')
print cursor.fetchall()
Until here there is no problem. When I want to INSERT into hive.
Let's say I want to excute this query : INSERT INTO table2 SELECT Col1, Col2 FROM table1;
I tried :
cur.execute('INSERT INTO table2 SELECT Col1, Col2 FROM table1')
I recieve this error
pyhive.exc.OperationalError: TExecuteStatementResp(status=TStatus(errorCode=1, errorMessage=u'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask', sqlState=u'08S01', infoMessages=[u'*org.apache.hive.service.cli.HiveSQLException:Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.tez.TezTask:28:27', u'org.apache.hive.service.cli.operation.Operation:toSQLException:Operation.java:388', u'org.apache.hive.service.cli.operation.SQLOperation:runQuery:SQLOperation.java:244', u'org.apache.hive.service.cli.operation.SQLOperation:runInternal:SQLOperation.java:279', u'org.apache.hive.service.cli.operation.Operation:run:Operation.java:324', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatementInternal:HiveSessionImpl.java:499', u'org.apache.hive.service.cli.session.HiveSessionImpl:executeStatement:HiveSessionImpl.java:475', u'sun.reflect.GeneratedMethodAccessor81:invoke::-1', u'sun.reflect.DelegatingMethodAccessorImpl:invoke:DelegatingMethodAccessorImpl.java:43', u'java.lang.reflect.Method:invoke:Method.java:498', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:78', u'org.apache.hive.service.cli.session.HiveSessionProxy:access$000:HiveSessionProxy.java:36', u'org.apache.hive.service.cli.session.HiveSessionProxy$1:run:HiveSessionProxy.java:63', u'java.security.AccessController:doPrivileged:AccessController.java:-2', u'javax.security.auth.Subject:doAs:Subject.java:422', u'org.apache.hadoop.security.UserGroupInformation:doAs:UserGroupInformation.java:1698', u'org.apache.hive.service.cli.session.HiveSessionProxy:invoke:HiveSessionProxy.java:59', u'com.sun.proxy.$Proxy33:executeStatement::-1', u'org.apache.hive.service.cli.CLIService:executeStatement:CLIService.java:270', u'org.apache.hive.service.cli.thrift.ThriftCLIService:ExecuteStatement:ThriftCLIService.java:507', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1437', u'org.apache.hive.service.rpc.thrift.TCLIService$Processor$ExecuteStatement:getResult:TCLIService.java:1422', u'org.apache.thrift.ProcessFunction:process:ProcessFunction.java:39', u'org.apache.thrift.TBaseProcessor:process:TBaseProcessor.java:39', u'org.apache.hive.service.auth.TSetIpAddressProcessor:process:TSetIpAddressProcessor.java:56', u'org.apache.thrift.server.TThreadPoolServer$WorkerProcess:run:TThreadPoolServer.java:286', u'java.util.concurrent.ThreadPoolExecutor:runWorker:ThreadPoolExecutor.java:1149', u'java.util.concurrent.ThreadPoolExecutor$Worker:run:ThreadPoolExecutor.java:624', u'java.lang.Thread:run:Thread.java:748'], statusCode=3), operationHandle=None)
If I excute the same query in hive directly everything run well.
Any thoughts?
NB: All my tables are external
CREATE EXTERNAL TABLE IF NOT EXISTS table ( col1 String, col2 String) stored as orc LOCATION 's3://somewhere' tblproperties ("orc.compress"="SNAPPY");
The solution was to add the username in the connection line; conn = hive.Connection(host="HOST", username="USER")
From what I understand hive queries divided on many type of operations (jobs). While you are performing a simple query (ie. SELECT * FROM table) This reads data from the hive metastore no mapReduce job or tmp tables needed to perform the query. But as soon as you switch to more complicated queries (ie. using JOINs) you end up having the same error.
The file code looks like this:
# Import hive module and connect
from pyhive import hive
conn = hive.Connection(host="HOST", username="USER")
cur = conn.cursor()
query = "INSERT INTO table2 SELECT Col1, Col2 FROM table1"
So maybe it needs permission or something.. I will search more about this behavior and update the answer.
I'm not sure how to insert a pandas df using pyhive, but if you have pyspark installed, one option is that you could convert to a spark df and use pyspark to do it.
from pyspark.sql import sqlContext
spark_df = sqlContext.createDataFrame(pandas_df)
You can do the following using spark.
from pyspark.sql import sqlContext
# convert the pandas data frame to spark data frame
spark_df = sqlContext.createDataFrame(pandas_df)
# register the spark data frame as temp table
# execute insert statement using spark sql
sqlContext,sql("insert into hive_table select * from my_temp_table")
This will insert data in your data frame to a hive table.
Hope this helps you
I have a dictionary with 3 keys which correspond to field names in a SQL Server table. The values of these keys come from an excel file and I store this dictionary in a dataframe which I now need to insert into a SQL table. This can all be seen in the code below:
import pandas as pd
import pymssql
fp = "file path"
data = pd.read_excel(fp,sheetname ="CRM View" )
row_date = data.loc[3, ]
row_sita = "ABZPD"
row_event = data.iloc[12, :]
df = pd.DataFrame({'date': row_date,
'sita': row_sita,
'event': row_event
}, index=None)
df = df[4:]
df = df.fillna("")
My question is how do I insert this dictionary into a SQL table now?
Also, as a side note, this code is part of a loop which needs to go through several excel files one by one, insert the data into dictionary then into SQL then delete the data in the dictionary and start again with the next excel file.
You could try something like this:
import MySQLdb
# connect
conn = MySQLdb.connect("","username","passwore","table")
x = conn.cursor()
# write
x.execute('INSERT into table (row_date, sita, event) values ("%d", "%d", "%d")' % (row_date, sita, event))
# close
You might have to change it a little based on your SQL restrictions, but should give you a good start anyway.
For the pandas dataframe, you can use the pandas built-in method to_sql to store in db. Following is the way to use it.
import sqlalchemy as sa
params = urllib.quote_plus("DRIVER={};SERVER={};DATABASE={};Trusted_Connection=True;".format("{SQL Server}",
conn_str = 'mssql+pyodbc:///?odbc_connect={}'.format(params)
engine = sa.create_engine(conn_str)
df.to_sql(<table_name>, engine,schema=<schema_name>, if_exists="append", index=False)
For this method you you will need to install sqlalchemy package.
pip install sqlalchemy
You will also need to setup the MSSql DSN on the machine.