I have created a sqlite database using pandas df.to_sql however accessing it seems considerably slower than just reading in the 500mb csv file.
I need to:
set the primary key for each table using the df.to_sql method
tell the sqlite database what datatype each of the columns in my
3.dataframe are? - can I pass a list like [integer,integer,text,text]
code.... (format code button not working)
if ext == ".csv":
df = pd.read_csv("/Users/data/" +filename)
columns = df.columns columns = [i.replace(' ', '_') for i in columns]
df.columns = columns
df.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)
Unfortunately there is no way right now to set a primary key in the pandas df.to_sql() method. Additionally, just to make things more of a pain there is no way to set a primary key on a column in sqlite after a table has been created.
However, a work around at the moment is to create the table in sqlite with the pandas df.to_sql() method. Then you could create a duplicate table and set your primary key followed by copying your data over. Then drop your old table to clean up.
It would be something along the lines of this.
import pandas as pd
import sqlite3
df = pd.read_csv("/Users/data/" +filename)
columns = df.columns columns = [i.replace(' ', '_') for i in columns]
#write the pandas dataframe to a sqlite table
df.columns = columns
df.to_sql(name,con,flavor='sqlite',schema=None,if_exists='replace',index=True,index_label=None, chunksize=None, dtype=None)
#connect to the database
conn = sqlite3.connect('database')
c = conn.curser()
c.executescript('''
PRAGMA foreign_keys=off;
BEGIN TRANSACTION;
ALTER TABLE table RENAME TO old_table;
/*create a new table with the same column names and types while
defining a primary key for the desired column*/
CREATE TABLE new_table (col_1 TEXT PRIMARY KEY NOT NULL,
col_2 TEXT);
INSERT INTO new_table SELECT * FROM old_table;
DROP TABLE old_table;
COMMIT TRANSACTION;
PRAGMA foreign_keys=on;''')
#close out the connection
c.close()
conn.close()
In the past I have done this as I have faced this issue. Just wrapped the whole thing as a function to make it more convenient...
In my limited experience with sqlite I have found that not being able to add a primary key after a table has been created, not being able to perform Update Inserts or UPSERTS, and UPDATE JOIN has caused a lot of frustration and some unconventional workarounds.
Lastly, in the pandas df.to_sql() method there is a a dtype keyword argument that can take a dictionary of column names:types. IE: dtype = {col_1: TEXT}
Building on Chris Guarino's answer, here's some functions that provide a more general solution. See the example at the bottom for how to use them.
import re
def get_create_table_string(tablename, connection):
sql = """
select * from sqlite_master where name = "{}" and type = "table"
""".format(tablename)
result = connection.execute(sql)
create_table_string = result.fetchmany()[0][4]
return create_table_string
def add_pk_to_create_table_string(create_table_string, colname):
regex = "(\n.+{}[^,]+)(,)".format(colname)
return re.sub(regex, "\\1 PRIMARY KEY,", create_table_string, count=1)
def add_pk_to_sqlite_table(tablename, index_column, connection):
cts = get_create_table_string(tablename, connection)
cts = add_pk_to_create_table_string(cts, index_column)
template = """
BEGIN TRANSACTION;
ALTER TABLE {tablename} RENAME TO {tablename}_old_;
{cts};
INSERT INTO {tablename} SELECT * FROM {tablename}_old_;
DROP TABLE {tablename}_old_;
COMMIT TRANSACTION;
"""
create_and_drop_sql = template.format(tablename = tablename, cts = cts)
connection.executescript(create_and_drop_sql)
# Example:
# import pandas as pd
# import sqlite3
# df = pd.DataFrame({"a": [1,2,3], "b": [2,3,4]})
# con = sqlite3.connect("deleteme.db")
# df.to_sql("df", con, if_exists="replace")
# add_pk_to_sqlite_table("df", "index", con)
# r = con.execute("select sql from sqlite_master where name = 'df' and type = 'table'")
# print(r.fetchone()[0])
There is a gist of this code here
In pandas version 0.15, to_sql() got an argument dtype, which can be used to set both dtype and the primary key attribute for all columns:
import sqlite3
import pandas as pd
df = pd.DataFrame({'MyID': [1, 2, 3], 'Data': [3, 2, 6]})
with sqlite3.connect('foo.db') as con:
df.to_sql('df', con=con, dtype={'MyID': 'INTEGER PRIMARY KEY',
'Data': 'FLOAT'})
Building on Chris Guarino's answer, it is almost impossible to assign a Primary key to an already existing column using df.to_sql() method. Likewise in your 500mb csv file you cannot create an duplicate table with huge number of columns.
However a small Workaround of addding a new column as Primary key while creation of dataframe to SQL. It is possible to iterate over Pandas dataframe.columns function to create a new database and while the creation you can add a Primary key. With this duplicate table is not needed.
i am adding a small Code snippet of it.
import pandas as pd
import sqlite3
import sqlalchemy
from sqlalchemy import create_engine
df= pd.read_excel(r'C:\XXX\XXX\XXXX\XXX.xlsx',sep=';')
X1 = df1.iloc[0:,0:]
dataset = X1.astype('float32')
dataset['date'] = pd.date_range(start='1/1/2020', periods=len(dataset), freq='D')
dataset=dataset.set_index('date')
engine = create_engine('sqlite:///measurement.db')
sqlite_connection = engine.connect()
sqlite_table = "table1"
sqlite_connection.execute("CREATE TABLE table1 (id INTEGER PRIMARY KEY AUTOINCREMENT, date TIMESTAMP, " +
",".join(["%s REAL" % x for x in dataset.columns]) + ")" )
dataset.to_sql(sqlite_table, sqlite_connection, if_exists='append')
Output database table:
[(0, 'id', 'INTEGER', 0, None, 1),
(1, 'date', 'TIMESTAMP', 0, None, 0),
(2, 'time_stamp', 'REAL', 0, None, 0),
(3, 'column_1', 'REAL', 0, None, 0),
(4, 'column_2', 'REAL', 0, None, 0)]
This method works only if the dataframe has an index. Also to have the index as column in our table it should be explicitly defined while writing our query.
Hope this helps for huge database creations.
In Sqlite, with a normal rowid table, unless the primary key is a single INTEGER column (See ROWIDs and the INTEGER PRIMARY KEY in the documentation), it's equivalent to a UNIQUE index (Because the real PK of a normal table is the rowid).
Notes from the documentation for rowid tables:
The PRIMARY KEY of a rowid table (if there is one) is usually not the true primary key for the table, in the sense that it is not the unique key used by the underlying B-tree storage engine. The exception to this rule is when the rowid table declares an INTEGER PRIMARY KEY. In the exception, the INTEGER PRIMARY KEY becomes an alias for the rowid.
The true primary key for a rowid table (the value that is used as the key to look up rows in the underlying B-tree storage engine) is the rowid.
The PRIMARY KEY constraint for a rowid table (as long as it is not the true primary key or INTEGER PRIMARY KEY) is really the same thing as a UNIQUE constraint. Because it is not a true primary key, columns of the PRIMARY KEY are allowed to be NULL, in violation of all SQL standards.
So you can easily fake a primary key after creating the table with:
CREATE UNIQUE INDEX mytable_fake_pk ON mytable(pk_column)
Besides the NULL thing, you won't get the benefits of an INTEGER PRIMARY KEY if your column is supposed to hold integers, like taking up less space and auto-generating values on insert if left out, but it'll otherwise work for most purposes.
There is another option for getting pandas to create a primary key on table creation using some undocumented methods from the pandas internals (at your own risk). You can peruse the code here. The key is the keys param of SQLTable which is not exposed in the to_sql API.
Note that I reset_index and set index=False in the call to SQLTable to prevent a duplicate/unnecessary index from being created in addition to the primary key constraint.
from pandas.io.sql import SQLTable, pandasSQL_builder
df = <your dataframe>
engine = <sqlalchemy engine>
table = SQLTable(
"my_table",
pandasSQL_builder(engine, schema="my_schema"),
frame=df.reset_index(),
index=False,
keys=df.index.names,
if_exists=if_exists,
schema="my_schema",
)
table.create() # Will honor your if_exists settings
table.insert(chunksize, method="multi") # This hits limits in allowed sqlite params if chunks are too large
There is also a get_schema function in that file that can get you a create table statement if you want to do something manually.
There's no way to do that. You can only set the primary key directly in the database after you move the data.
Related
I use the pandas method to_sql to append a DataFrame to some sqlite table. The sqlite table has a foreign key constraints on a column id_region that pandas should consider. The available values for id_region are 1, 2.
If the DataFrame contains a non-existing id_region value 3, I would expect to_sql to throw an exception.
However, the data is written to the database without exception and the foreign key constraint is ignored.
If I manually change the value in the sqlite database using Navicat, for example to 1 and then back to 3, I get the expected error.
=> The foreign key constraint in sqlite seems to work but not when inserting the data.
=> How can I tell pandas to consider the foreign key constraint?
Example code to reproduce the issue:
import sqlite3
import pandas as pd
file_path = 'demo.sqlite'
id_region = pd.DataFrame([
{'id': 1, 'label': 'foo'},
{'id': 2, 'label': 'baa'},
])
id_region.set_index(['id'], inplace=True)
data = pd.DataFrame([
{'id': 1, 'id_region': 3, 'value': 1}
])
data.set_index(['id'], inplace=True)
create_data_table_query = 'CREATE TABLE `data` (' +\
'id integer PRIMARY KEY NOT NULL, ' +\
'id_region integer NOT Null, ' +\
'value real NOT NULL, ' + \
'FOREIGN KEY (id_region) REFERENCES id_region(id)' + \
')'
with sqlite3.connect(file_path) as connection:
id_region.to_sql('id_region', connection, index_label='id')
cursor = connection.cursor()
cursor.execute(create_data_table_query )
data.to_sql('data', connection, index_label='id', if_exists='append')
Tables created by the above code:
id_region:
data, referencing id_region:
The foreign key constraints must explicitly be enabled for each connection in sqlite3 and pandas does not seem to have an option to do so automatically. The command 'PRAGMA foreign_keys = ON' has to be executed for the connection before it is passed to the to_sql command of pandas:
cursor = connection.cursor()
cursor.execute(create_data_table_query )
cursor.execute('PRAGMA foreign_keys = ON')
data.to_sql('data', connection, index_label='id', if_exists='append')
sqlite4 will have this enabled by default.
Also see
Does SQLite3 not support foreign key constraints?
How can I transfer data from one MySQL database to another? The other database may have different field names, except id, which will act as the primary key.
I have tried using mysqlalchemy, but the only data that gets mapped are the filed names that are same in both databases.
import sqlalchemy
db1 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/mydatabase1")
db2 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/nava")
print('Writing...')
query = ''' (SELECT * FROM customers1)'''
df = pd.read_sql(query, db1)
print(df)
#query1 = ''UPDATE 'leap' SET `leap`value '''
df.to_sql('nap', db2, index=False, if_exists='append')
i get error that other database dosent have same field names but what i want is that even if the field names change data still gets mapped with reference to the primary key id
this is the program that i asked about in the above question but there was an error so code hasent appeared in the right way
import pandas as pd
import sqlalchemy
db1 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/mydatabase1")
db2 = sqlalchemy.create_engine("mysql+pymysql://root:#localhost:3306/nava")
print('Writing...')
query = ''' (SELECT * FROM customers1)'''
df = pd.read_sql(query, db1)
df.to_sql('nap', db2, index=False, if_exists='append')
I am attempting to query a subset of a MySql database table, feed the results into a Pandas DataFrame, alter some data, and then write the updated rows back to the same table. My table size is ~1MM rows, and the number of rows I will be altering will be relatively small (<50,000) so bringing back the entire table and performing a df.to_sql(tablename,engine, if_exists='replace') isn't a viable option. Is there a straightforward way to UPDATE the rows that have been altered without iterating over every row in the DataFrame?
I am aware of this project, which attempts to simulate an "upsert" workflow, but it seems it only accomplishes the task of inserting new non-duplicate rows rather than updating parts of existing rows:
GitHub Pandas-to_sql-upsert
Here is a skeleton of what I'm attempting to accomplish on a much larger scale:
import pandas as pd
from sqlalchemy import create_engine
import threading
#Get sample data
d = {'A' : [1, 2, 3, 4], 'B' : [4, 3, 2, 1]}
df = pd.DataFrame(d)
engine = create_engine(SQLALCHEMY_DATABASE_URI)
#Create a table with a unique constraint on A.
engine.execute("""DROP TABLE IF EXISTS test_upsert """)
engine.execute("""CREATE TABLE test_upsert (
A INTEGER,
B INTEGER,
PRIMARY KEY (A))
""")
#Insert data using pandas.to_sql
df.to_sql('test_upsert', engine, if_exists='append', index=False)
#Alter row where 'A' == 2
df_in_db.loc[df_in_db['A'] == 2, 'B'] = 6
Now I would like to write df_in_db back to my 'test_upsert' table with the updated data reflected.
This SO question is very similar, and one of the comments recommends using an "sqlalchemy table class" to perform the task.
Update table using sqlalchemy table class
Can anyone expand on how I would implement this for my specific case above if that is the best (only?) way to implement it?
I think the easiest way would be to:
first delete those rows that are going to be "upserted". This can be done in a loop, but it's not very efficient for bigger data sets (5K+ rows), so i'd save this slice of the DF into a temporary MySQL table:
# assuming we have already changed values in the rows and saved those changed rows in a separate DF: `x`
x = df[mask] # `mask` should help us to find changed rows...
# make sure `x` DF has a Primary Key column as index
x = x.set_index('a')
# dump a slice with changed rows to temporary MySQL table
x.to_sql('my_tmp', engine, if_exists='replace', index=True)
conn = engine.connect()
trans = conn.begin()
try:
# delete those rows that we are going to "upsert"
engine.execute('delete from test_upsert where a in (select a from my_tmp)')
trans.commit()
# insert changed rows
x.to_sql('test_upsert', engine, if_exists='append', index=True)
except:
trans.rollback()
raise
PS i didn't test this code so it might have some small bugs, but it should give you an idea...
A MySQL specific solution using Panda's to_sql "method" arg and sqlalchemy's mysql insert on_duplicate_key_update features:
def create_method(meta):
def method(table, conn, keys, data_iter):
sql_table = db.Table(table.name, meta, autoload=True)
insert_stmt = db.dialects.mysql.insert(sql_table).values([dict(zip(keys, data)) for data in data_iter])
upsert_stmt = insert_stmt.on_duplicate_key_update({x.name: x for x in insert_stmt.inserted})
conn.execute(upsert_stmt)
return method
engine = db.create_engine(...)
conn = engine.connect()
with conn.begin():
meta = db.MetaData(conn)
method = create_method(meta)
df.to_sql(table_name, conn, if_exists='append', method=method)
Here is a general function that will update each row (but all values in the row simultaneously)
def update_table_from_df(df, table, where):
'''Will take a dataframe and update each specified row in the SQL table
with the DF values -- DF columns MUST match SQL columns
WHERE statement should be triple-quoted string
Will not update any columns contained in the WHERE statement'''
update_string = f'UPDATE {table} set '
for idx, row in df.iterrows():
upstr = update_string
for col in list(df.columns):
if (col != 'datetime') & (col not in where):
if col != df.columns[-1]:
if type(row[col] == str):
upstr += f'''{col} = '{row[col]}', '''
else:
upstr += f'''{col} = {row[col]}, '''
else:
if type(row[col] == str):
upstr += f'''{col} = '{row[col]}' '''
else:
upstr += f'''{col} = {row[col]} '''
upstr += where
cursor.execute(upstr)
cursor.commit()```
I was struggling with this before and now I've found a way.
Basically create a separate data frame in which you keep data that you only have to update.
df #updating data in dataframe
s_update = "" #String of updations
# Loop through the data frame
for i in range(len(df)):
s_update += "update your_table_name set column_name = '%s' where column_name = '%s';"%(df[col_name1][i], df[col_name2][i])
Now pass s_update to cursor.execute or engine.execute (wherever you execute SQL query)
This will update your data instantly.
Python Version - 2.7.6
Pandas Version - 0.17.1
MySQLdb Version - 1.2.5
In my database ( PRODUCT ) , I have a table ( XML_FEED ). The table XML_FEED is huge ( Millions of record )
I have a pandas.DataFrame() ( PROCESSED_DF ). The dataframe has thousands of rows.
Now I need to run this
REPLACE INTO TABLE PRODUCT.XML_FEED
(COL1, COL2, COL3, COL4, COL5),
VALUES (PROCESSED_DF.values)
Question:-
Is there a way to run REPLACE INTO TABLE in pandas? I already checked pandas.DataFrame.to_sql() but that is not what I need. I do not prefer to read XML_FEED table in pandas because it very huge.
With the release of pandas 0.24.0, there is now an official way to achieve this by passing a custom insert method to the to_sql function.
I was able to achieve the behavior of REPLACE INTO by passing this callable to to_sql:
def mysql_replace_into(table, conn, keys, data_iter):
from sqlalchemy.dialects.mysql import insert
from sqlalchemy.ext.compiler import compiles
from sqlalchemy.sql.expression import Insert
#compiles(Insert)
def replace_string(insert, compiler, **kw):
s = compiler.visit_insert(insert, **kw)
s = s.replace("INSERT INTO", "REPLACE INTO")
return s
data = [dict(zip(keys, row)) for row in data_iter]
conn.execute(table.table.insert(replace_string=""), data)
You would pass it like so:
df.to_sql(db, if_exists='append', method=mysql_replace_into)
Alternatively, if you want the behavior of INSERT ... ON DUPLICATE KEY UPDATE ... instead, you can use this:
def mysql_replace_into(table, conn, keys, data_iter):
from sqlalchemy.dialects.mysql import insert
data = [dict(zip(keys, row)) for row in data_iter]
stmt = insert(table.table).values(data)
update_stmt = stmt.on_duplicate_key_update(**dict(zip(stmt.inserted.keys(),
stmt.inserted.values())))
conn.execute(update_stmt)
Credits to https://stackoverflow.com/a/11762400/1919794 for the compile method.
Till this version (0.17.1) I am unable find any direct way to do this in pandas. I reported a feature request for the same.
I did this in my project with executing some queries using MySQLdb and then using DataFrame.to_sql(if_exists='append')
Suppose
1) product_id is my primary key in table PRODUCT
2) feed_id is my primary key in table XML_FEED.
SIMPLE VERSION
import MySQLdb
import sqlalchemy
import pandas
con = MySQLdb.connect('localhost','root','my_password', 'database_name')
con_str = 'mysql+mysqldb://root:my_password#localhost/database_name'
engine = sqlalchemy.create_engine(con_str) #because I am using mysql
df = pandas.read_sql('SELECT * from PRODUCT', con=engine)
df_product_id = df['product_id']
product_id_str = (str(list(df_product_id.values))).strip('[]')
delete_str = 'DELETE FROM XML_FEED WHERE feed_id IN ({0})'.format(product_id_str)
cur = con.cursor()
cur.execute(delete_str)
con.commit()
df.to_sql('XML_FEED', if_exists='append', con=engine)# you can use flavor='mysql' if you do not want to create sqlalchemy engine but it is depreciated
Please note:-
The REPLACE [INTO] syntax allows us to INSERT a row into a table, except that if a UNIQUE KEY (including PRIMARY KEY) violation occurs, the old row is deleted prior to the new INSERT, hence no violation.
I needed a generic solution to this problem, so I built on shiva's answer--maybe it will be helpful to others. This is useful in situations where you grab a table from a MySQL database (whole or filtered), update/add some rows, and want to perform a REPLACE INTO statement with df.to_sql().
It finds the table's primary keys, performs a delete statement on the MySQL table with all keys from the pandas dataframe, and then inserts the dataframe into the MySQL table.
def to_sql_update(df, engine, schema, table):
df.reset_index(inplace=True)
sql = ''' SELECT column_name from information_schema.columns
WHERE table_schema = '{schema}' AND table_name = '{table}' AND
COLUMN_KEY = 'PRI';
'''.format(schema=schema, table=table)
id_cols = [x[0] for x in engine.execute(sql).fetchall()]
id_vals = [df[col_name].tolist() for col_name in id_cols]
sql = ''' DELETE FROM {schema}.{table} WHERE 0 '''.format(schema=schema, table=table)
for row in zip(*id_vals):
sql_row = ' AND '.join([''' {}='{}' '''.format(n, v) for n, v in zip(id_cols, row)])
sql += ' OR ({}) '.format(sql_row)
engine.execute(sql)
df.to_sql(table, engine, schema=schema, if_exists='append', index=False)
If you use to_sql you should be able to define it so that you replace values if they exist, so for a table named 'mydb' and a dataframe named 'df', you'd use:
df.to_sql(mydb,if_exists='replace')
That should replace values if they already exist, but I am not 100% sure if that's what you're looking for.
Normally, if i want to insert values into a table, i will do something like this (assuming that i know which columns that the values i want to insert belong to):
conn = sqlite3.connect('mydatabase.db')
conn.execute("INSERT INTO MYTABLE (ID,COLUMN1,COLUMN2)\
VALUES(?,?,?)",[myid,value1,value2])
But now i have a list of columns (the length of list may vary) and a list of values for each columns in the list.
For example, if i have a table with 10 columns (Namely, column1, column2...,column10 etc). I have a list of columns that i want to update.Let's say [column3,column4]. And i have a list of values for those columns. [value for column3,value for column4].
How do i insert the values in the list to the individual columns that each belong?
As far as I know the parameter list in conn.execute works only for values, so we have to use string formatting like this:
import sqlite3
conn = sqlite3.connect(':memory:')
conn.execute('CREATE TABLE t (a integer, b integer, c integer)')
col_names = ['a', 'b', 'c']
values = [0, 1, 2]
conn.execute('INSERT INTO t (%s, %s, %s) values(?,?,?)'%tuple(col_names), values)
Please notice this is a very bad attempt since strings passed to the database shall always be checked for injection attack. However you could pass the list of column names to some injection function before insertion.
EDITED:
For variables with various length you could try something like
exec_text = 'INSERT INTO t (' + ','.join(col_names) +') values(' + ','.join(['?'] * len(values)) + ')'
conn.exec(exec_text, values)
# as long as len(col_names) == len(values)
Of course string formatting will work, you just need to be a bit cleverer about it.
col_names = ','.join(col_list)
col_spaces = ','.join(['?'] * len(col_list))
sql = 'INSERT INTO t (%s) values(%s)' % (col_list, col_spaces)
conn.execute(sql, values)
I was looking for a solution to create columns based on a list of unknown / variable length and found this question. However, I managed to find a nicer solution (for me anyway), that's also a bit more modern, so thought I'd include it in case it helps someone:
import sqlite3
def create_sql_db(my_list):
file = 'my_sql.db'
table_name = 'table_1'
init_col = 'id'
col_type = 'TEXT'
conn = sqlite3.connect(file)
c = conn.cursor()
# CREATE TABLE (IF IT DOESN'T ALREADY EXIST)
c.execute('CREATE TABLE IF NOT EXISTS {tn} ({nf} {ft})'.format(
tn=table_name, nf=init_col, ft=col_type))
# CREATE A COLUMN FOR EACH ITEM IN THE LIST
for new_column in my_list:
c.execute('ALTER TABLE {tn} ADD COLUMN "{cn}" {ct}'.format(
tn=table_name, cn=new_column, ct=col_type))
conn.close()
my_list = ["Col1", "Col2", "Col3"]
create_sql_db(my_list)
All my data is of the type text, so I just have a single variable "col_type" - but you could for example feed in a list of tuples (or a tuple of tuples, if that's what you're into):
my_other_list = [("ColA", "TEXT"), ("ColB", "INTEGER"), ("ColC", "BLOB")]
and change the CREATE A COLUMN step to:
for tupl in my_other_list:
new_column = tupl[0] # "ColA", "ColB", "ColC"
col_type = tupl[1] # "TEXT", "INTEGER", "BLOB"
c.execute('ALTER TABLE {tn} ADD COLUMN "{cn}" {ct}'.format(
tn=table_name, cn=new_column, ct=col_type))
As a noob, I can't comment on the very succinct, updated solution #ron_g offered. While testing, though I had to frequently delete the sample database itself, so for any other noobs using this to test, I would advise adding in:
c.execute('DROP TABLE IF EXISTS {tn}'.format(
tn=table_name))
Prior the the 'CREATE TABLE ...' portion.
It appears there are multiple instances of
.format(
tn=table_name ....)
in both 'CREATE TABLE ...' and 'ALTER TABLE ...' so trying to figure out if it's possible to create a single instance (similar to, or including in, the def section).