I need to insert a CSV file into a table on SQL Server using Python (BULK INSERT is turned off). Instead of using SQLAlchemy I'm writing my own function (may God forgive me). I'm creating lists of SQL code as strings
sql_code_list = ["insert into table_name values (1,'aa'),(2,'ab'),(3,'ac')...(100,'az')",
"insert into table_name values (101,'ba'),(102,'bb'),(103,'bc')...(200,'bz')"]
and I plan to run them in the DB using pyodbc package one by one. To ensure data integrity, I want to use BEGIN TRANS ... ROLLBACK / COMMIT TRANS ... syntaxis. So I want to send command
DECLARE #TransactionName varchar(20) = 'TransInsert'
BEGIN TRANS #TransactionName
then send all my ```INSERT`` statements, and send on success
DECLARE #TransactionName varchar(20) = 'TransInsert'
COMMIT TRANS #TransactionName
or on failure
DECLARE #TransactionName varchar(20) = 'TransInsert'
ROLLBACK TRANS #TransactionName
There will be many INSERT statements, let's say 10,000 statements each inserting 100 rows, and I plan to send them from the same connection.cursor object but in multiple batches. Does this overall look like a correct procedure? What problems may I run into when I send these commands from a Python application?
There is no need for a named transaction here.
You could submit a transactional batch of multiple statements like this to conditionally rollback and throw on error:
SET XACT_ABORT, NO_COUNT ON;
BEGIN TRY
BEGIN TRAN;
<insert-statements-here>;
COMMIT;
END TRY
BEGIN CATCH
IF ##TRANCOUNT > 0 ROLLBACK;
THROW;
END CATCH;
The maximum SQL Server batch size is 64K * and the default network packet size is 4K, so each batch may be up to 256MB by default. 10K inserts will likely fit within that limit so you could try sending all in a single batch and break it into multiple smaller batches only if needed.
An alternative method to insert multiple rows is with an INSERT...SELECT from a table-valued parameter source. See this answer for an example of passing TVP value. I would expect much better performance with that technique because it avoids parsing a large batch and SQL Server internally bulk-inserts TVP data into tempdb.
Related
I am trying to insert data contained in a .csv file from my pc to a remote server. The values are inserted in a table that contains 3 columns, namely Timestamp, Value and TimeseriesID. I have to insert approximately 3000 rows at a time, therefore I am currently using pyodbc and executemany.
My code up to now is the one shown below:
with contextlib.closing(pyodbc.connect(connection_string, autocommit=True)) as conn:
with contextlib.closing(conn.cursor()) as cursor:
cursor.fast_executemany = True # new in pyodbc 4.0.19
# Innsert values in the DataTable table
insert_df = df[["Time (UTC)", column]]
insert_df["id"] = timeseriesID
insert_df = insert_df[["id", "Time (UTC)", column]]
sql = "INSERT INTO %s (%s, %s, %s) VALUES (?, ?, ?)" % (
sqltbl_datatable, 'TimeseriesId', 'DateTime', 'Value')
params = [i.tolist() for i in insert_df.values]
cursor.executemany(sql, params)
As I am using pyodbc 4.0.19, I have the option fast_executemany set to True, which is supposed to speed up things. However, for some reason, I do not see any great improvement when I enable the fast_executemany option. Is there any alternative way that I could use in order to speed up insertion of my file?
Moreover, regarding the performance of the code shown above, I noticed that when disabling the autocommit=True option, and instead I included the cursor.commit() command in the end of my data was imported significantly faster. Is there any specific reason why this happens that I am not aware of?
Any help would be greatly appreciated :)
Regarding the cursor.commit() speed up that you are noticing: when you are using autocommit=True you are requesting the code to execute one database transaction per each of the insert. This means that the code resumes only after the database confirms the data is stored on disk. When you use cursor.commit() after the numerous INSERTs you are effectively executing one database transaction and the data is stored in RAM in the interim (it may be written to disk but not all at the time when you instruct the database to finalize the transaction).
The process of finalizing the transaction typically entails updating tables on disk, updating indexes, flushing logs, syncing copies, etc. which is costly. That is why you observe such a speed up between the 2 scenarios you describe.
When going the faster way please note that until you execute cursor.commit() you cannot be 100% sure that the data is in the database so there may be a need to reissue the query in case of an error (any partial transaction is going to be rolled back).
I am using psycopg2 module in python to read from postgres database, I need to some operation on all rows in a column, that has more than 1 million rows.
I would like to know would cur.fetchall() fail or cause my server to go down? (since my RAM might not be that big to hold all that data)
q="SELECT names from myTable;"
cur.execute(q)
rows=cur.fetchall()
for row in rows:
doSomething(row)
what is the smarter way to do this?
The solution Burhan pointed out reduces the memory usage for large datasets by only fetching single rows:
row = cursor.fetchone()
However, I noticed a significant slowdown in fetching rows one-by-one. I access an external database over an internet connection, that might be a reason for it.
Having a server side cursor and fetching bunches of rows proved to be the most performant solution. You can change the sql statements (as in alecxe answers) but there is also pure python approach using the feature provided by psycopg2:
cursor = conn.cursor('name_of_the_new_server_side_cursor')
cursor.execute(""" SELECT * FROM table LIMIT 1000000 """)
while True:
rows = cursor.fetchmany(5000)
if not rows:
break
for row in rows:
# do something with row
pass
you find more about server side cursors in the psycopg2 wiki
Consider using server side cursor:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process. If the query returned an huge amount of data, a
proportionally large amount of memory will be allocated by the client.
If the dataset is too large to be practically handled on the client
side, it is possible to create a server side cursor. Using this kind
of cursor it is possible to transfer to the client only a controlled
amount of data, so that a large dataset can be examined without
keeping it entirely in memory.
Here's an example:
cursor.execute("DECLARE super_cursor BINARY CURSOR FOR SELECT names FROM myTable")
while True:
cursor.execute("FETCH 1000 FROM super_cursor")
rows = cursor.fetchall()
if not rows:
break
for row in rows:
doSomething(row)
fetchall() fetches up to the arraysize limit, so to prevent a massive hit on your database you can either fetch rows in manageable batches, or simply step through the cursor till its exhausted:
row = cur.fetchone()
while row:
# do something with row
row = cur.fetchone()
Here is the code to use for simple server side cursor with the speed of fetchmany management.
The principle is to use named cursor in Psycopg2 and give it a good itersize to load many rows at once like fetchmany would do but with a single loop of for rec in cursor that does an implicit fetchnone().
With this code I make queries of 150 millions rows from multi-billion rows table within 1 hour and 200 meg ram.
EDIT: using fetchmany (along with fetchone() and fetchall(), even with a row limit (arraysize) will still send the entire resultset, keeping it client-side (stored in the underlying c library, I think libpq) for any additional fetchmany() calls, etc. Without using a named cursor (which would require an open transaction), you have to resort to using limit in the sql with an order-by, then analyzing the results and augmenting the next query with where (ordered_val = %(last_seen_val)s and primary_key > %(last_seen_pk)s OR ordered_val > %(last_seen_val)s)
This is misleading for the library to say the least, and there should be a blurb in the documentation about this. I don't know why it's not there.
Not sure a named cursor is a good fit without having a need to scroll forward/backward interactively? I could be wrong here.
The fetchmany loop is tedious but I think it's the best solution here. To make life easier, you can use the following:
from functools import partial
from itertools import chain
# from_iterable added >= python 2.7
from_iterable = chain.from_iterable
# util function
def run_and_iterate(curs, sql, parms=None, chunksize=1000):
if parms is None:
curs.execute(sql)
else:
curs.execute(sql, parms)
chunks_until_empty = iter(partial(fetchmany, chunksize), [])
return from_iterable(chunks_until_empty)
# example scenario
for row in run_and_iterate(cur, 'select * from waffles_table where num_waffles > %s', (10,)):
print 'lots of waffles: %s' % (row,)
As I was reading comments and answers I thought I should clarify something about fetchone and Server-side cursors for future readers.
With normal cursors (client-side), Psycopg fetches all the records returned by the backend, transferring them to the client process. The whole records are buffered in the client's memory. It is when you execute a query like curs.execute('SELECT * FROM ...'.
This question also confirms that.
All the fetch* methods are there for accessing this stored data.
Q: So how fetchone can help us memory wise ?
A: It fetches only one record from the stored data and creates a single Python object and hands you in your Python code while fetchall will fetch and create n Python objects from this data and hands it to you all in one chunk.
So If your table has 1,000,000 records, this is what's going on in memory:
curs.execute --> whole 1,000,000 result set + fetchone --> 1 Python object
curs.execute --> whole 1,000,000 result set + fetchall --> 1,000,000 Python objects
Of-course fetchone helped but still we have the whole records in memory. This is where Server-side cursors comes into play:
PostgreSQL also has its own concept of cursor (sometimes also called
portal). When a database cursor is created, the query is not
necessarily completely processed: the server might be able to produce
results only as they are needed. Only the results requested are
transmitted to the client: if the query result is very large but the
client only needs the first few records it is possible to transmit
only them.
...
their interface is the same, but behind the scene they
send commands to control the state of the cursor on the server (for
instance when fetching new records or when moving using scroll()).
So you won't get the whole result set in one chunk.
The draw-back :
The downside is that the server needs to keep track of the partially
processed results, so it uses more memory and resources on the server.
My application is very database intensive so I'm trying to reduce the load on the database. I am using PostgreSQL as rdbms and python is the programming language.
To reduce the load I am already using a caching mechanism in the application. The caching type I used is a server cache, browser cache.
Currently I'm tuning the PostgreSQL query cache to get it in line with the characteristics of queries being run on the server.
Questions:
Is it possible to fine tune query cache on a per database level?
Is it possible to fine tune query cache on a per table basis?
please provide tutorial to learn query cache in PostgreSQL.
Tuning PostgreSQL is far more than just tuning caches. In fact, the primary high level things are "shared buffers" (think of this as the main data and index cache), and the work_mem.
The shared buffers help with reading and writing. You want to give it a decent size, but it's for the entire cluster.. and you can't really tune it on a per table or especially query basis. Importantly, it's not really storing query results.. it's storing tables, indexes and other data. In an ACID compliant database, it's not very efficient or useful to cache query results.
The "work_mem" is used to sort query results in memory and not have to resort to writing to disk.. depending on your query, this area could be as important as the buffer cache, and easier to tune. Before running a query that needs to do a larger sort, you can issue the set command like "SET work_mem = '256MB';"
As others have suggested you can figure out WHY a query is running slowly using "explain". I'd personally suggest learning the "access path" postgresql is taking to get to your data. That's far more involved and honestly a better use of resources than simply thinking of "caching results".
You can honestly improve things a lot with data design as well and using features such as partitioning, functional indexes, and other techniques.
One other thing is that you can get better performance by writing better queries.. things like "with" clauses can prevent postgres' optimizer from optimizing queries fully.
The optimizer itself also has parameters that can be adjusted-- so that the DB will spend more (or less) time optimizing a query prior to executing it.. which can make a difference.
You can also use certain techniques to write queries to help the optimizer. One such technique is to use bind variables (colon variables)--- this will result in the optimizer getting the same query over and over with different data passed in. This way, the structure doesn't have to be evaluated over and over.. query plans can be cached in this way.
Without seeing some of your queries, your table and index designs, and an explain plan, it's hard to make specific recommendation.
In general, you need to find queries that aren't as performant as you feel they should be and figure out where the contention is. Likely it's disk access, however,the cause is ultimately the most important part.. is it having to go to disk to hold the sort? Is it internally choosing a bad path to get to the data, such that it's reading data that could easily be eliminated earlier in the query process... I've been an oracle certified DBA for over 20 years, and PostgreSQL is definitely different, however, many of the same techniques are used when it comes to diagnosing a query's performance issues. Although you can't really provide hints, you can still rewrite queries or tune certain parameters to get better performace.. in general, I've found postgresql to be easier to tune in the long run. If you can provide some specifics, perhaps a query and explain info, I'd be happy to give you specific recommendations. Sadly, though, "cache tuning" is likely to provide you the speed you're wanting all on its own.
I developed a system for caching results, to speed-up results queried from a web-based solution. I reproduced below in essence what it did:
The following are the generic caching handling tables and functions.
CREATE TABLE cached_results_headers (
cache_id serial NOT NULL PRIMARY KEY,
date timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_access timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP,
relid regclass NOT NULL,
query text NOT NULL,
rows int NOT NULL DEFAULT 0
);
CREATE INDEX ON cached_results_headers (relid, md5(query));
CREATE TABLE cached_results (
cache_id int NOT NULL,
row_no int NOT NULL
);
CREATE OR REPLACE FUNCTION f_get_cached_results_header (p_cache_table text, p_source_relation regclass, p_query text, p_max_lifetime interval, p_clear_old_data interval) RETURNS cached_results_headers AS $BODY$
DECLARE
_cache_id int;
_rows int;
BEGIN
IF p_clear_old_data IS NOT NULL THEN
DELETE FROM cached_results_headers WHERE date < CURRENT_TIMESTAMP - p_clear_old_data;
END IF;
_cache_id := cache_id FROM cached_results_headers WHERE relid = p_source_relation AND md5(query) = md5(p_query) AND query = p_query AND date > CURRENT_TIMESTAMP - p_max_lifetime;
IF _cache_id IS NULL THEN
INSERT INTO cached_results_headers (relid, query) VALUES (p_source_relation, p_query) RETURNING cache_id INTO _cache_id;
EXECUTE $$ INSERT INTO $$||p_cache_table||$$ SELECT $1, row_number() OVER (), r.r FROM ($$||p_query||$$) r $$ USING _cache_id;
GET DIAGNOSTICS _rows = ROW_COUNT;
UPDATE cached_results_headers SET rows = _rows WHERE cache_id = _cache_id;
ELSE
UPDATE cached_results_headers SET last_access = CURRENT_TIMESTAMP;
END IF;
RETURN (SELECT h FROM cached_results_headers h WHERE cache_id = _cache_id);
END;
$BODY$ LANGUAGE PLPGSQL SECURITY DEFINER;
The following is an example of how to use the tables and functions above, for a given view named my_view with a field key to be selected within a range of integer values. You would replace all the following with your particular needs, and replace my_view with either a table, a view, or a function. Also replace the filtering parameters as required.
CREATE VIEW my_view AS SELECT ...; -- create a query with your data, with one of the integer columns in the result as "key" to filter by
CREATE TABLE cached_results_my_view (
row my_view NOT NULL,
PRIMARY KEY (cache_id, row_no),
FOREIGN KEY (cache_id) REFERENCES cached_results_headers ON DELETE CASCADE
) INHERITS (cached_results);
CREATE OR REPLACE FUNCTION f_get_my_view_cached_rows (p_filter1 int, p_filter2 int, p_row_from int, p_row_to int) RETURNS SETOF my_view AS $BODY$
DECLARE
_cache_id int;
BEGIN
_cache_id := cache_id
FROM f_get_cached_results_header('cached_results_my_view', 'my_view'::regclass,
'SELECT r FROM my_view r WHERE key BETWEEN '||p_filter1::text||' AND '||p_filter2::text||' ORDER BY key',
'15 minutes'::interval, '1 day'::interval); -- cache for 15 minutes max since creation time; delete all cached data older than 1 day old
RETURN QUERY
SELECT (row).*
FROM cached_results_my_view
WHERE cache_id = _cache_id AND row_no BETWEEN p_row_from AND p_row_to
ORDER BY row_no;
END;
$BODY$ LANGUAGE PLPGSQL;
Example: Retrieve rows from 1 to 2000 from cached my_view results filtered by key BETWEEN 30044 AND 10610679. Run a first time and the results of the query will be cached into table cached_results_my_view, and the first 2000 records will be returned. Run it again shortly after and the results will be retrieved from the table cached_results_my_view directly without executing the query.
SELECT * FROM f_get_my_view_cached_rows(30044, 10610679, 1, 2000);
I have a table of three columnsid,word,essay.I want to do a query using (?). The sql sentence is sql1 = "select id,? from training_data". My code is below:
def dbConnect(db_name,sql,flag):
conn = sqlite3.connect(db_name)
cursor = conn.cursor()
if (flag == "danci"):
itm = 'word'
elif flag == "wenzhang":
itm = 'essay'
n = cursor.execute(sql,(itm,))
res1 = cursor.fetchall()
return res1
However, when I print dbConnect("data.db",sql1,"danci")
The result I obtained is [(1,'word'),(2,'word'),(3,'word')...].What I really want to get is [(1,'the content of word column'),(2,'the content of word column')...]. What should I do ? Please give me some ideas.
You can't use placeholders for identifiers -- only for literal values.
I don't know what to suggest in this case, as your function takes a database nasme, an SQL string, and a flag to say how to modify that string. I think it would be better to pass just the first two, and write something like
sql = {
"danci": "SELECT id, word FROM training_data",
"wenzhang": "SELECT id, essay FROM training_data",
}
and then call it with one of
dbConnect("data.db", sql['danci'])
or
dbConnect("data.db", sql['wenzhang'])
But a lot depends on why you are asking dbConnect to decide on the columns to fetch based on a string passed in from outside; it's an unusual design.
Update - SQL Injection
The problems with SQL injection and tainted data is well documented, but here is a summary.
The principle is that, in theory, a programmer can write safe and secure programs as long as all the sources of data are under his control. As soon as they use any information from outside the program without checking its integrity, security is under threat.
Such information ranges from the obvious -- the parameters passed on the command line -- to the obscure -- if the PATH environment variable is modifiable then someone could induce a program to execute a completely different file from the intended one.
Perl provides direct help to avoid such situations with Taint Checking, but SQL Injection is the open door that is relevant here.
Suppose you take the value for a database column from an unverfied external source, and that value appears in your program as $val. Then, if you write
my $sql = "INSERT INTO logs (date) VALUES ('$val')";
$dbh->do($sql);
then it looks like it's going to be okay. For instance, if $val is set to 2014-10-27 then $sql becomes
INSERT INTO logs (date) VALUES ('2014-10-27')
and everything's fine. But now suppose that our data is being provided by someone less than scrupulous or downright malicious, and your $val, having originated elsewhere, contains this
2014-10-27'); DROP TABLE logs; SELECT COUNT(*) FROM security WHERE name != '
Now it doesn't look so good. $sql is set to this (with added newlines)
INSERT INTO logs (date) VALUES ('2014-10-27');
DROP TABLE logs;
SELECT COUNT(*) FROM security WHERE name != '')
which adds an entry to the logs table as before, end then goes ahead and drops the entire logs table and counts the number of records in the security table. That isn't what we had in mind at all, and something we must guard against.
The immediate solution is to use placeholders ? in a prepared statement, and later passing the actual values in a call to execute. This not only speeds things up, because the SQL statement can be prepared (compiled) just once, but protects the database from malicious data by quoting every supplied value appropriately for the data type, and escaping any embedded quotes so that it is impossible to close one statement and another open another.
This whole concept was humourised in Randall Munroe's excellent XKCD comic
I am using pyodbc to retrieve data from a Microsoft SQL Server. The query is of the following form
SET NOCOUNT ON --Ignore count statements
CREATE TABLE mytable ( ... )
EXEC some_stored_procedure
INSERT mytable
--Perform some processing...
SELECT *
FROM mytable
The stored procedure performs some aggregation over values that contain NULLs such that warnings of the form Warning: Null value is eliminated by an aggregate or other SET operation. are issued. This results in pyodbc failing to retrieve data with the error message No results. Previous SQL was not a query.
I have tried to disable the warnings by setting SET ANSI_WARNINGS OFF. However, the query then fails with the error message Heterogeneous queries require the ANSI_NULLS and ANSI_WARNINGS options to be set for the connection. This ensures consistent query semantics. Enable these options and then reissue your query..
Is it possible to
disable the warnings
or have pyodbc ignore the warnings?
Note that I do not have permissions to change the stored procedure.
Store the results of the query in a temporary table and execute the statement as two queries:
with pyodbc.connect(connection_string) as connection:
connection.execute(query1) #Do the work
result = connection.execute(query2) #Select the data
data = result.fetchall() #Retrieve the data
The first query does the heavy lifting and is of the form
--Do some work and execute complicated queries that issue warning messages
--Store the results in a temporary table
SELECT some, column, names
INTO #datastore
FROM some_table
The second query retrieves the data and is of the form
SELECT * FROM #datastore
Thus, all warning messages are issued upon execution of the first query. They do not interfere with data retrieval during the execution of the second query.
I have had some luck mitigating this error by flipping ansi_warnings on and off just around the offending view or stored proc.
/* vw_someView aggregates away some nulls and presents warnings that blow up pyodbc */
set ANSI_WARNINGS off
select *
into #my_temp
from vw_someView
set ANSI_WARNINGS on
/* rest of query follows */
This assumes that the entity that produces the aggregate warning doesn't also require warnings to be turned on. If it complains, it probably means that the entity itself has a portion of code like this that requires a toggle of the ansi_warnings (or a rewrite to eliminate the aggregation.)
One caviat is that I've found that this toggle still returns the "heterogeneous" warning if I try to run it as a cross-server query. Also, while debugging, it's pretty easy to get into a state where the ansi_warnings are flipped off when you don't realize it and you start getting heterogeneous errors for seemingly no reason. Just run the "set ANSI_WARNINGS on" line by itself to get yourself back into a good state.
Best thing is to add try: except: block
sql="sp_help stored_procedure;"
print(">>>>>executing {}".format(sql))
next_cursor=cursor.execute(sql)
while next_cursor:
try:
row = cursor.fetchone()
while row:
print(row)
row = cursor.fetchone()
except Exception as my_ex:
print("stored procedure returning non-row {}".format(my_ex))
next_cursor=cursor.nextset()