I am querying a Postgres database for a large number of results and want to use server side cursors to stream the results to my client. It looks like when I do this, the rowcount attribute of the cursor is now set to -1 after I execute the query. I'm creating the cursor like so:
with db.cursor('cursor_name') as cursor:
Is there a way to find the number of results of my query while streaming results from the database? (I could do a SELECT COUNT(*), but I'd like to avoid that because I'm trying to abstract away the code around the query and that would complicate the API).
In the case of a server-side cursor, although cursor.execute() returns, the query has not necessarily been executed by the server at that point, and so the row count is not available to psycopg2. This is consistent with the DBAPI 2.0 spec which states that rowcount should be -1 if the row count of the last operation is indeterminate.
Attempts to coerce it with cursor.fetchone(), for example, updates cursor.rowcount, but only by the number of items retrieved, so that is not useful. cursor.fetchall() will result in rowcount being correctly set, however, that performs the full query and transfer of data that you seek to avoid.
A possible workaround that avoids a completely separate query to get the count, and which should give accurate results is:
select *, (select count(*) from test) from test;
This will result in each row having the table row count appended as the final column. You can then get the table row count using cursor.fetchone() and then taking the final column:
with db.cursor('cursor_name') as cursor:
cursor.execute('select *, (select count(*) from test) from test')
row = cursor.fetchone()
data, count = row[:-1], row[-1]
Now count will contain the number of rows in the table. You can use row[:-1] to refer to the row data.
This might slow down the query because a possibly expensive SELECT COUNT(*) will be performed, but once done retrieving the data should be fast.
Related
I'm using server-side cursor in PostgreSQL with psycopg2, based on this well-explained answer.
with conn.cursor(name='name_of_cursor') as cursor:
query = "SELECT * FROM tbl FOR UPDATE"
cursor.execute(query)
for row in cursor:
# process row
In processing each row, I'd like to update a few fields in the row using PostgreSQL's UPDATE tbl SET ... WHERE CURRENT OF name_of_cursor (docs), but it seems that, when the for loop enters and row is set, the position of the server-side cursor is in a different record, so while I can run the command, the wrong record is updated.
How can I make sure the result iterator is in the same position as the cursor? (also preferably in a way that won't make the loop slower than updating using an ID)
The reason why a different record was being updated was because internally psycopg2 does a FETCH FORWARD 1000 (or whatever the default chunk size is), positioning the cursor at the end of the block. You can override this by fetching one record at a time:
updcursor = conn.cursor()
with conn.cursor(name='name_of_cursor') as cursor:
cursor.itersize = 1 # to make server-side cursor be in the same position as the iterator
cursor.execute('SELECT * FROM tbl FOR UPDATE')
for row in cursor:
# process row...
updcursor.execute('UPDATE tbl SET fld1 = %s WHERE CURRENT OF name_of_cursor', [val])
The snippet above will update the correct record. Note that you cannot use the same cursor for selecting and updating, they must be different cursors.
Performance
Reducing the FETCH size to 1 reduces the retrieval performance by a lot. I definitely wouldn't recommend using this technique if you're iterating a large dataset (which is probably the case you're searching for server-side cursors) from a different host than the PostgreSQL server.
I ended up using a combination of exporting records to CSV, then importing them later using COPY FROM (with the function copy_expert).
I'm trying to delete all the entries from a table but are not able to do it.
Does not matter if it is TRUNCATE, or DELETE keyword. The same error occurs
import pyodbc
conn = pyodbc.connect(
r'Driver={SQL Server};'
r'Server=' + ip + '\SQLEXPRESS;'
r'Database=...;'
r'UID=...;'
r'PWD=...;', timeout=5)
cursor = conn.cursor()
data = cursor.execute("TRUNCATE TABLE table_name")
pyodbc.ProgrammingError: No results. Previous SQL was not a query.
Setting autocommit to True does not work. Parametrizing it also does not work. The connection is right because SELECT clause works well and returns the right value. With truncating and deleting it does not work at all. The DDBB is still intact.
When excecuting from the pycharm's Python Console i get the folowwing error whenever i try to access the data object (f.e. print(data.fetchval()):
Traceback (most recent call last):
File "", line 1, in
pyodbc.ProgrammingError: No results. Previous SQL was not a query.
I've read before i might have to do with how the DDBB table is indexed and its private key, but i'm not able to explain it.
I was hoping on getting the number of rows affected.
When we execute a single SQL statement via Cursor.execute, the server can return one of three things:
zero or more rows of data in a result set (for a SELECT statement), or
an integer row count (for DML statements like UPDATE, DELETE, etc.), or
an error.
We retrieve information from a result set via the pyodbc methods .fetchall(), .fetchone(), .fetchval(), etc.. We retrieve row counts using the cursor's rowcount attribute.
crsr = cnxn.cursor()
crsr.execute("DROP TABLE IF EXISTS so64124053")
crsr.execute("CREATE TABLE so64124053 (id int primary key, txt varchar(10))")
crsr.execute("INSERT INTO so64124053 (id, txt) VALUES (1, 'foo')")
print(crsr.rowcount) # 1
print(crsr.execute("SELECT COUNT(*) AS n FROM so64124053").fetchval()) # 1
crsr.execute("INSERT INTO so64124053 (id, txt) VALUES (2, 'bar')")
print(crsr.rowcount) # 1
print(crsr.execute("SELECT COUNT(*) AS n FROM so64124053").fetchval()) # 2
Note that TRUNCATE is a special case because it doesn't bother counting the rows it removes from the table; it just returns a row count of -1 …
crsr.execute("TRUNCATE TABLE so64124053")
print(crsr.rowcount) # -1
… however the rows are indeed removed
print(crsr.execute("SELECT COUNT(*) AS n FROM so64124053").fetchval()) # 0
I have an Oracle DB with over 5 Million rows with columns of type varchar and blob. In order to connect to the database and read the records I use python 3.6 with a JDBC driver and the library JayDeBeApi. What I am trying to achieve is to read each row, perform some
operations on the records (use a regex for example) and then store the new record values in a new table. I don't want to load all records in the memory, so what I want to do is to consequently fetch them from the database, store the fetched data, process it and then add it to the other table.
Currently I fetch all the records at once instead for example first 1000, then the next 1000 and so on. This is what I have so far:
statement = "... a select statement..."
connection= dbDriver.connect(jclassname,[driver_url,username,password],jars,)
cursor = connection.cursor()
cursor.execute(statement)
fetched = cursor.fetchall()
for result in fetched:
preprocess(result)
cursor.close()
How could I modify my code to fetch consequently and where to put the second statement which inserts the new values in the other table?
As you said, fetchall() is a bad idea in this case, as it loads all the data into the memory.
In order to avoid that you can iterate over cursor object itself:
cur.execute("SELECT * FROM test")
for row in cur: # iterate over result set row by row
do_stuff_with_row(row)
cur.close()
Does anyone know how to get the row count from an SQL Alchemy query ResultProxy object without looping through the result set? The ResultProxy.rowcount attribute shows 0, I would expect it to have a value of 2. For updates it shows the number of rows affected which is what I would expect.
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
engine = create_engine(
'oracle+cx_oracle://user:pass#host:port/database'
)
session = sessionmaker(
bind = engine
, autocommit = False
, autoflush = False
)()
sql_text = u"""
SELECT 1 AS Val FROM dual UNION ALL
SELECT 2 AS Val FROM dual
"""
results = session.execute(sql_text)
print '%s rows returned by query...\n' % results.rowcount
print results.keys()
for i in results:
print repr(i)
Output:
0 rows returned by query...
[u'val']
(1,)
(2,)
resultproxy.rowcount is ultimately a proxy for the DBAPI attribute cursor.rowcount. Most DBAPIs do not provide the "count of rows" for a SELECT query via this attribute; its primary purpose is to provide the number of rows matched by an UPDATE or DELETE statement. A relational database in fact does not know how many rows would be returned by a particular statement until it has finished locating all of those rows; many DBAPI implementations will begin returning rows as the database finds them, without buffering, so no such count is even available in those cases.
To get the count of rows a SELECT query would return, you either need to do a SELECT COUNT(*) up front, or you need to fetch all the rows into an array and perform len() on the array.
The notes at ResultProxy.rowcount discuss this further (http://docs.sqlalchemy.org/en/latest/core/connections.html?highlight=rowcount#sqlalchemy.engine.ResultProxy.rowcount):
Notes regarding ResultProxy.rowcount:
This attribute returns the number of rows matched, which is not necessarily the same as the number of rows that were actually modified - an UPDATE statement, for example, may have no net change on a given row if the SET values given are the same as those present in the row
already. Such a row would be matched but not modified. On backends
that feature both styles, such as MySQL, rowcount is configured by
default to return the match count in all cases.
ResultProxy.rowcount is only useful in conjunction with an UPDATE or DELETE statement. Contrary to what the Python DBAPI says, it does
not return the number of rows available from the results of a SELECT
statement as DBAPIs cannot support this functionality when rows are
unbuffered.
ResultProxy.rowcount may not be fully implemented by all dialects. In particular, most DBAPIs do not support an aggregate rowcount result
from an executemany call. The ResultProxy.supports_sane_rowcount() and ResultProxy.supports_sane_multi_rowcount() methods will report from
the dialect if each usage is known to be supported.
Statements that use RETURNING may not return a correct rowcount.
You could use this:
rowcount = len(results._saved_cursor._result.rows)
Then your code will be
print '%s rows returned by query...\n' % rowcount
print results.keys()
Only tested 'find' queries
It works for me.
I have an sqlite table with a few hundred million rows:
sqlite> create table t1(id INTEGER PRIMARY KEY,stuff TEXT );
I need to query this table by its integer primary key hundreds of millions of times. My code:
conn = sqlite3.connect('stuff.db')
with conn:
cur = conn.cursor()
for id in ids:
try:
cur.execute("select stuff from t1 where rowid=?",[id])
stuff_tuple = cur.fetchone()
#do something with the fetched row
except:
pass #for when id is not in t1's key set
Here, ids is a list that may have tens of thousands of elements. Forming t1 did not take very long (ie ~75K inserts per second). Querying t1 the way I've done it is unacceptably slow (ie ~1K queries in 10 seconds).
I am completely new to SQL. What am I doing wrong?
Since you're retrieving values by their keys, it seems like a key/value store would be more appropriate in this case. Relational databases (Sqlite included) are definitely feature-rich, but you can't beat the performance of a simple key/value store.
There are several to choose from:
Redis: "advanced key-value store", very fast, optimized for in-memory operation
Cassandra: extremely high performance, scalable, used by multiple high-profile sites
MongoDB: feature-rich, tries to be "middle ground" between relational and NoSQL (and they've started offering free online classes)
And there's many, many more.
You should make one sql call instead, should be must faster
conn = sqlite3.connect('stuff.db')
with conn:
cur = conn.cursor()
for row in cur.execute("SELECT stuff FROM t1 WHERE rowid IN (%s)" % ','.join('?'*len(ids)), ids):
#do something with the fetched row
pass
you do not need a try except since ids not in the db will not show up. If you want to know which ids are not in the results, you can do:
ids_res = set()
for row in c.execute(...):
ids_res.add(row['id'])
ids_not_found = ids_res.symmetric_difference(ids)