I have a large (18GB) CSV file I am trying to load into a database through the use of a python script. My approach is broken down like so:
Load the file into a temporary table, this is so the file gets loaded without failing due to any duplicate issues or primary keys.
Copy the data from the table into a new table with indexes and Conflict handling for potential dupes.
My python code and the SQL command is as follows:
engine = sqlalchemy.create_engine('postgres://adatabasestring')
event.listen(engine, 'connect', init_search_path)
connection = engine.raw_connection()
cursor = connection.cursor()
file_path = "/path/to/file.csv"
print("COPYING file to temp table")
keyword_adding_sql = """
CREATE TEMP TABLE tmp_table
(LIKE working_table INCLUDING DEFAULTS)
ON COMMIT DROP;
COPY tmp_table FROM '{file_path}' DELIMITER ',' CSV HEADER;
INSERT INTO working_table
SELECT *
FROM tmp_table
ON CONFLICT DO NOTHING;
"""
cursor.execute(keyword_adding_sql.format(file_path=file_path))
connection.commit()
The script was ran from my local machine but the file it was copying and the database itself are on the remote server. I ran the script and left it running and came back to an error code saying the server closed unexpectedly. I was worried the transaction was cancelled or the like but I queried pg_stat_activity and the query itself is still listed as active.
Will the query ever be finished or is it hanging on itself? I checked the working_table size and it looks like all the data is there as expected.
Related
I think I've lost my mind. I have created a python script to read a temp table in SQL SSMS. While testing, we found out that python is able to query and read the table even when it's not there/queryable in SSMS. I believe the DF is storing in cache or something but let me break down the problem into steps:
Starting point, the temp table is present in SSMS, MAIN_DF = python.read_sql('SELECT Statement') and stored in DF and saved to excel file (using ExcelWriter)
We delete the temp table in SQL, then run the python script again. To make sure, we use THE SAME 'SELECT' statement found in the python script in SSMS and it displays 'Invalid object name' which is correct because the table has been dropped. BUT when I run the python script again, it is able to query the table and get the same results it had before! It should be throwing the same error as SSMS because the table isn't there! Why isn't python starting from scratch when I run it? It seems to be holding information over from the initial run. How do I ensure I am starting from scratch every time I run it?
I have tried many things including starting the script with blank DF's so they should not have anything held over. 'MAIN_DF = pd.DataFrame()'. I have tried deleting the DF's at the end as well. 'del MAIN_DF'
I don't understand what is happening..
try:
conn = pyodbc.connect(r'Driver={SQL Server};Server=GenericServername;Database=testdb;Trusted_Connection=yes;')
print('Connected to SQL: ' + str(datetime.now()))
MAIN_DF = pd.read_sql('SELECT statement',conn)
print('Queried Main DF: ' + str(datetime.now()))
It's because I didn't close the connection conn.close() so it was cached in the memory and SQL didn't perform / close it
Sorry about this unprofessional question but I'm kinda new to sqlite but I was wondering if there's any way I can open two files in same python command db = sqlite3.connect('./cogs/database/users.sqlite')
when I open this in my command it doesn't allow me to do same thing in the same command to open another file so for example
open db = sqlite3.connect('./cogs/database/users.sqlite') and read something from it if so
open db = sqlite3.connect('./cogs/database/anotherfile.sqlite') and insert to it
but it always accepts first file only and ignore second file
Assign db1 so it connects to users.sqlite,
and db2 so it connects to anotherfile.sqlite.
Then you can e.g. SELECT from one
and INSERT into the other,
with a temp var bridging the two.
Sqlite databases are single file based, so no - sqlite3.connect builds a connection object to a single data base file.
Even if you build two connection objects, you can't execute queries across them.
If you really need the data from two files at a time you need to merge that data into one data - or don't use sqlite.
You can execute queries across two SQLite files, but you will need to execute an ATTACH command on the first connection cursor.
conn = sqlite3.connect("users.sqlite")
cur = conn.cursor()
cmd = "ATTACH DATABASE 'anotherfile.sqlite' AS otra"
try:
cur.execute(cmd)
query = """
SELECT
t1.Id, t1.Name, t2.Address
FROM personnel t1
LEFT JOIN otra.location t2
ON t2.PersonId = t1.Id
WHERE t1.Status = 'current'
ORDER BY t1.Name;
"""
cur.execute(query)
rows = cur.fetchall()
except sqlite3.Error as err:
do_something_with(err)
I use python (my IDE is pycharm) and new to SQlite. I read that I must use commit in order to save the data or changes, otherwise non of those would be saved to the table. I use a simple code to create a table in a database without using commit, define the headers and close the database file. Using DB_Browser I then open the file and see it is updated to what I have just made. Then my question is why do I need the commit command ?
import sqlite3
from sqlite3 import Error
# Connecting SQLite to the Database
def create_connection(db_file):
""" create a database connection to a SQLite database """
try:
# Creates or opens a file called mydb with a SQLite3 DB
db = sqlite3.connect(db_file)
# Get a cursor object
cursor = db.cursor()
# Check if table users does not exist and create it
cursor.execute('''CREATE TABLE IF NOT EXISTS
users(id INTEGER PRIMARY KEY, name TEXT, phone TEXT, email TEXT unique, password TEXT)''')
except Error as e:
# Roll back any change if something goes wrong
db.rollback()
raise e
finally:
# Close the db connection
db.close()
fname = "mydb.db"
create_connection(fname)
commit()
This method commits the current transaction. If you don’t call this method, anything you did since the last call to commit() is not visible from other database connections. If you wonder why you don’t see the data you’ve written to the database, please check you didn’t forget to call this method
https://docs.python.org/2/library/sqlite3.html
Kindly go through the documentation you'll find answer 90% of the time.
Apparently, from this link
By default, SQLite is in auto-commit mode.
Thanks to Richard for pointing this link.
I am trying to open a .sqlite3 file in python but I see no information is returned. So I tried r and still get empty for tables. I would like to know what tables are in this file.
I used the following code for python:
import sqlite3
from sqlite3 import Error
def create_connection(db_file):
""" create a database connection to the SQLite database
specified by the db_file
:param db_file: database file
:return: Connection object or None
"""
try:
conn = sqlite3.connect(db_file)
return conn
except Error as e:
print(e)
return None
database = "D:\\...\assignee.sqlite3"
conn = create_connection(database)
cur = conn.cursor()
rows = cur.fetchall()
but rows are empty!
This is where I got the assignee.sqlite3 from:
https://github.com/funginstitute/downloads
I also tried RStudio, below is the code and results:
> con <- dbConnect(drv=RSQLite::SQLite(), dbname="D:/.../assignee")
> tables <- dbListTables(con)
But this is what I get
first make sure you provided correct path on your connection string to the sql
light db ,
use this conn = sqlite3.connect("C:\users\guest\desktop\example.db")
also make sure you are using the SQLite library in the unit tests and the production code
check the types of sqllite connection strings and determain which one your db belongs to :
Basic
Data Source=c:\mydb.db;Version=3;
Version 2 is not supported by this class library.
SQLite
In-Memory Database
An SQLite database is normally stored on disk but the database can also be
stored in memory. Read more about SQLite in-memory databases.
Data Source=:memory:;Version=3;New=True;
SQLite
Using UTF16
Data Source=c:\mydb.db;Version=3;UseUTF16Encoding=True;
SQLite
With password
Data Source=c:\mydb.db;Version=3;Password=myPassword;
so make sure you wrote the proper connection string for your sql lite db
if you still cannot see it, check if the disk containing /tmp full otherwise , it might be encrypted database, or locked and used by some other application maybe , you may confirm that by using one of the many tools for sql light database ,
you may downliad this tool , try to navigate directly to where your db exist and it will give you indication of the problem .
download windows version
Download Mac Version
Download linux version
good luck
LOAD is a DB2 utility that I would like to use to insert data into a table from a CSV file. How can I do this in Python using the ibm_db driver? I don't see anything in the docs here
CMD: LOAD FROM xyz OF del INSERT INTO FOOBAR
Running this as standard SQL fails as expected:
Transaction couldn't be completed: [IBM][CLI Driver][DB2/LINUXX8664] SQL0104N An unexpected token "LOAD FROM xyz OF del" was found following "BEGIN-OF-STATEMENT". Expected tokens may include: "<space>". SQLSTATE=42601 SQLCODE=-104
Using the db2 CLP directly (i.e. os.system('db2 -f /path/to/script.file')) is not an option as DB2 sits on a different machine that I don't have SSH access to.
EDIT:
Using the ADMIN_CMD utility also doesn't work because the file being loaded cannot be put on the database server due to firewall. For now, I've switched to using INSERT
LOAD is an IBM command line processor command, not an SQL command. Is such, it isn't available through the ibm_db module.
The most typical way to do this would be to load the CSV data into Python (either all the rows or in batches if it is too large for memory) then use a bulk insert to insert many rows at once into the database.
To perform a bulk insert you can use the execute_many method.
You could CALL the ADMIN_CMD procedure. ADMIN_CMD has support for both LOAD and IMPORT. Note that both commands require the loaded/imported file to be on the database server.
The example is taken from the DB2 Knowledge Center:
CALL SYSPROC.ADMIN_CMD('load from staff.del of del replace
keepdictionary into SAMPLE.STAFF statistics use profile
data buffer 8')
CSV to DB2 with Python
Briefly: One solution is to use an SQLAlchemy adapter and Db2’s External Tables.
SQLAlchemy:
The Engine is the starting point for any SQLAlchemy application. It’s “home base” for the actual database and its DBAPI, delivered to the SQLAlchemy application through a connection pool and a Dialect, which describes how to talk to a specific kind of database/DBAPI combination.
Where above, an Engine references both a Dialect and a Pool, which together interpret the DBAPI’s module functions as well as the behavior of the database.
Creating an engine is just a matter of issuing a single call, create_engine():
dialect+driver://username:password#host:port/database
Where dialect is a database name such as mysql, oracle, postgresql, etc., and driver the name of a DBAPI, such as psycopg2, pyodbc, cx_oracle, etc.
Load data by using transient external table:
Transient external tables (TETs) provide a way to define an external table that exists only for the duration of a single query.
TETs have the same capabilities and limitations as normal external tables. A special feature of a TET is that you do not need to define the table schema when you use the TET to load data into a table or when you create the TET as the target of a SELECT statement.
Following is the syntax for a TET:
INSERT INTO <table> SELECT <column_list | *>
FROM EXTERNAL 'filename' [(table_schema_definition)]
[USING (external_table_options)];
CREATE EXTERNAL TABLE 'filename' [USING (external_table_options)]
AS select_statement;
SELECT <column_list | *> FROM EXTERNAL 'filename' (table_schema_definition)
[USING (external_table_options)];
For information about the values that you can specify for the external_table_options variable, see External table options.
General example
Insert data from a transient external table into the database table on the Db2 server by issuing the following command:
INSERT INTO EMPLOYEE SELECT * FROM external '/tmp/employee.dat' USING (delimiter ',' MAXERRORS 10 SOCKETBUFSIZE 30000 REMOTESOURCE 'JDBC' LOGDIR '/logs' )
Requirements
pip install ibm-db
pip install SQLAlchemy
Pyton code
One example below shows how it works together.
from sqlalchemy import create_engine
usr = "enter_username"
pwd = "enter_password"
hst = "enter_host"
prt = "enter_port"
db = "enter_db_name"
#SQL Alchemy URL
conn_params = "db2+ibm_db://{0}:{1}#{2}:{3}/{4}".format(usr, pwd, hst, prt, db)
shema = "enter_name_restore_shema"
table = "enter_name_restore_table"
destination = "/path/to/csv/file_name.csv"
try:
print("Connecting to DB...")
engine = create_engine(conn_params)
engine.connect() # optional, output: DB2/linux...
print("Successfully Connected!")
except Exception as e:
print("Unable to connect to the server.")
print(str(e))
external = """INSERT INTO {0}.{1} SELECT * FROM EXTERNAL '{2}' USING (CCSID 1208 DELIMITER ',' REMOTESOURCE LZ4 NOLOG TRUE )""".format(
shema, table, destination
)
try:
print("Restoring data to the server...")
engine.execute(external)
print("Data restored successfully.")
except Exception as e:
print("Unable to restore.")
print(str(e))
Conclusion
A great solution for restoredlarge files, specifically, 600m worked without any problems.
It is also useful for copying data from one table/database to another table. So that the backup is done as an export of csv and then that csv into DB2 with the given example.
SQLAlchemy-Engine can be combined with other databases such as: sqlite, mysql, postgresql, oracle, mssql, etc.