I just had a discussion today with some coworkers about python's db-api fetchone vs fetchmany vs fetchall.
I'm sure the use case for each of these is dependent on the implementation of the db-api that I'm using, but in general what are the use cases for fetchone vs fetchmany vs fetchall?
In other words are the following equivalent? or is there one of these that is preferred over the others? and if so in which situations?
cursor.execute("SELECT id, name FROM `table`")
for i in xrange(cursor.rowcount):
id, name = cursor.fetchone()
print id, name
cursor.execute("SELECT id, name FROM `table`")
result = cursor.fetchmany()
while result:
for id, name in result:
print id, name
result = cursor.fetchmany()
cursor.execute("SELECT id, name FROM `table`")
for id, name in cursor.fetchall():
print id, name
As per official psycopg2 documentation
fetchone()
Fetch the next row of a query result set, returning a single tuple, or None when no more data is available:
>>> cur.execute("SELECT * FROM test WHERE id = %s", (3,))
>>> cur.fetchone()
(3, 42, 'bar')
A ProgrammingError is raised if the previous call to execute*() did not produce any result set or no call was issued yet.
fetchmany([size=cursor.arraysize])
Fetch the next set of rows of a query result, returning a list of tuples. An empty list is returned when no more rows are available.
The number of rows to fetch per call is specified by the parameter. If it is not given, the cursor’s arraysize determines the number of rows to be fetched. The method should try to fetch as many rows as indicated by the size parameter. If this is not possible due to the specified number of rows not being available, fewer rows may be returned:
>>> cur.execute("SELECT * FROM test;")
>>> cur.fetchmany(2)
[(1, 100, "abc'def"), (2, None, 'dada')]
>>> cur.fetchmany(2)
[(3, 42, 'bar')]
>>> cur.fetchmany(2)
[]
A ProgrammingError is raised if the previous call to execute*() did not produce any result set or no call was issued yet.
Note there are performance considerations involved with the size parameter. For optimal performance, it is usually best to use the arraysize attribute. If the size parameter is used, then it is best for it to retain the same value from one fetchmany() call to the next.
List item
fetchall()
Fetch all (remaining) rows of a query result, returning them as a list of tuples. An empty list is returned if there is no more record to fetch.
>>> cur.execute("SELECT * FROM test;")
>>> cur.fetchall()
[(1, 100, "abc'def"), (2, None, 'dada'), (3, 42, 'bar')]
A ProgrammingError is raised if the previous call to execute*() did not produce any result set or no call was issued yet.
I think it indeed depends on the implementation, but you can get an idea of the differences by looking into MySQLdb sources. Depending on the options, mysqldb fetch* keep the current set of rows in memory or server side, so fetchmany vs fetchone has some flexibility here to know what to keep in (python's) memory and what to keep db server side.
PEP 249 does not give much detail, so I guess this is to optimize things depending on the database while exact semantics are implementation-defined.
These are implementation specific.
fetchall
Will get all the results from the table. This will work better when size of the table is small. If the table size is bigger, fetchall will fail in those cases.
Will use most of the memory.
Will cause some issues will can occur if the queries is done on network.
fetchmany
fetchmany will get only required number of results. You can yield the results and process. Simple Snippet of implementation of fetchmany.
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
Related
I'm having a problem executing this SQL statement with a python list injection. I'm new to teradata SQL, and I'm not sure if this is the appropriate syntax for injecting a list into the where clause.
conn = teradatasql.connect(host='PROD', user='1234', password='1234', logmech='LDAP')
l = ["Comp-EN Routing", "Comp-COLLABORATION"]
l2 = ["PEO", "TEP"]
l3 = ["TCV"]
crsr = conn.cursor()
query = """SELECT SOURCE_ORDER_NUMBER
FROM DL_.BV_DETAIL
WHERE (LEVEL_1 IN ? AND LEVEL_2 IN ?) or LEVEL_3 IN ?"""
crsr.executemany(query, [l,l2,l3])
conn.autocommit = True
I keep getting this error
Version 17.0.0.2] [Session 308831600] [Teradata Database] [Error 3939] There is a mismatch between the number of parameters specified and the number of parameters required.
Late to answer this, but if I found the question someone else will in the future too.
executemany in teradatasql requires that second parameter to be a "sequence of sequences". The most common type of sequence we generally use in Python is a list. Essentially you need a list that contains, for each element in the list, a list.
In your case this may look like:
myListOfLists=[['level1valueA','level1valueA','level3valueA'],['level1valueB','level1valueB','level3valueB']]
Your SQL statement will be executed twice, once for each list in your list.
In your case though I suspect you are wanting to find any combination of the values that you have stored in your three lists which is entirely different ball of wax and is going to take some creativity (generate a list of list with all possible combinations and submit to executemany OR construct a SQL statement that can take in multiple comma delimited lists of values, form a cartesian product, and test for hits)
Want to add some regarding SELECT statement and executemany method: to retrieve all records returned by your query you will need to call .nextset() followed by .fetchall() as many times as it will become False. First .fetchall() will give you only first result (first list of parameters specified).
...
with teradatasql.connect(connectionstring) as conn:
with conn.cursor() as cur:
cur.executemany("SELECT COL1 FROM THEDATABASE.THETABLE WHERE COL1 = ?;",[['A'],['B']])
result=cur.fetchall() # will bring you only rows matching 'A'
if (cur.nextset()):
result2=cur.fetchall() # results for 'B'
...
According to PEP249, Cursor.execute has no defined return values. pyodbc, however, seems to make it return a cursor object; the docs say so, too, albeit rather briefly:
execute(...)
C.execute(sql, [params]) --> Cursor
Is this guaranteed/documented somewhere in more detail?
Looking at identities, the object returned appears to be the very same cursor, perhaps for chaining calls?
>>> thing_called_cursor = conn.cursor()
>>> result = thing_called_cursor.execute("SELECT * FROM Item")
>>> result
<pyodbc.Cursor object at 0x10b3290f0>
>>> thing_called_cursor
<pyodbc.Cursor object at 0x10b3290f0>
Also,
>>> id(result)
4482830576
>>> id(thing_called_cursor)
4482830576
I could try looking into the sources, but I'd rather not depend on anything I find there. Perhaps it is best to ignore whatever is currently being returned by Cursor.execute as doing so best meets the specification in the PEP?
You can see from the source that at the end it eventually returns a return (PyObject*)cur; which is the cursor that execute was passed in the first place. However, it does look like there are cases where it returns 0.
It looks like this is covered in the README.md as well
The DB API specification does not specify the return value of
Cursor.execute. Previous versions of pyodbc (2.0.x) returned different
values, but the 2.1 versions always return the Cursor itself.
This allows for compact code such as:
for row in cursor.execute("select album_id, photo_id from photos where user_id=1"):
print row.album_id, row.photo_id
row = cursor.execute("select * from tmp").fetchone()
rows = cursor.execute("select * from tmp").fetchall()
count = cursor.execute("update photos set processed=1 where user_id=1").rowcount
count = cursor.execute("delete from photos where user_id=1").rowcount
So it looks like its reason is advocating for compact code.
The typical MySQLdb library query can use a lot of memory and perform poorly in Python, when a large result set is generated. For example:
cursor.execute("SELECT id, name FROM `table`")
for i in xrange(cursor.rowcount):
id, name = cursor.fetchone()
print id, name
There is an optional cursor that will fetch just one row at a time, really speeding up the script and cutting the memory footprint of the script a lot.
import MySQLdb
import MySQLdb.cursors
conn = MySQLdb.connect(user="user", passwd="password", db="dbname",
cursorclass = MySQLdb.cursors.SSCursor)
cur = conn.cursor()
cur.execute("SELECT id, name FROM users")
row = cur.fetchone()
while row is not None:
doSomething()
row = cur.fetchone()
cur.close()
conn.close()
But I can't find anything about using SSCursor with with nested queries. If this is the definition of doSomething():
def doSomething()
cur2 = conn.cursor()
cur2.execute('select id,x,y from table2')
rows = cur2.fetchall()
for row in rows:
doSomethingElse(row)
cur2.close()
then the script throws the following error:
_mysql_exceptions.ProgrammingError: (2014, "Commands out of sync; you can't run this command now")
It sounds as if SSCursor is not compatible with nested queries. Is that true? If so that's too bad because the main loop seems to run too slowly with the standard cursor.
This problem in discussed a bit in the MySQLdb User's Guide, under the heading of the threadsafety attribute (emphasis mine):
The MySQL protocol can not handle multiple threads using the same
connection at once. Some earlier versions of MySQLdb utilized locking
to achieve a threadsafety of 2. While this is not terribly hard to
accomplish using the standard Cursor class (which uses
mysql_store_result()), it is complicated by SSCursor (which uses
mysql_use_result(); with the latter you must ensure all the rows have
been read before another query can be executed.
The documentation for the MySQL C API function mysql_use_result() gives more information about your error message:
When using mysql_use_result(), you must execute mysql_fetch_row()
until a NULL value is returned, otherwise, the unfetched rows are
returned as part of the result set for your next query. The C API
gives the error Commands out of sync; you can't run this command now
if you forget to do this!
In other words, you must completely fetch the result set from any unbuffered cursor (i.e., one that uses mysql_use_result() instead of mysql_store_result() - with MySQLdb, that means SSCursor and SSDictCursor) before you can execute another statement over the same connection.
In your situation, the most direct solution would be to open a second connection to use while iterating over the result set of the unbuffered query. (It wouldn't work to simply get a buffered cursor from the same connection; you'd still have to advance past the unbuffered result set before using the buffered cursor.)
If your workflow is something like "loop through a big result set, executing N little queries for each row," consider looking into MySQL's stored procedures as an alternative to nesting cursors from different connections. You can still use MySQLdb to call the procedure and get the results, though you'll definitely want to read the documentation of MySQLdb's callproc() method since it doesn't conform to Python's database API specs when retrieving procedure outputs.
A second alternative is to stick to buffered cursors, but split up your query into batches. That's what I ended up doing for a project last year where I needed to loop through a set of millions of rows, parse some of the data with an in-house module, and perform some INSERT and UPDATE queries after processing each row. The general idea looks something like this:
QUERY = r"SELECT id, name FROM `table` WHERE id BETWEEN %s and %s;"
BATCH_SIZE = 5000
i = 0
while True:
cursor.execute(QUERY, (i + 1, i + BATCH_SIZE))
result = cursor.fetchall()
# If there's no possibility of a gap as large as BATCH_SIZE in your table ids,
# you can test to break out of the loop like this (otherwise, adjust accordingly):
if not result:
break
for row in result:
doSomething()
i += BATCH_SIZE
One other thing I would note about your example code is that you can iterate directly over a cursor in MySQLdb instead of calling fetchone() explicitly over xrange(cursor.rowcount). This is especially important when using an unbuffered cursor, because the rowcount attribute is undefined and will give a very unexpected result (see: Python MysqlDB using cursor.rowcount with SSDictCursor returning wrong count).
Both methods return a list of the returned items of the query, did I miss something here, or they have identical usages indeed?
Any differences performance-wise?
If you are using the default cursor, a MySQLdb.cursors.Cursor, the entire result set will be stored on the client side (i.e. in a Python list) by the time the cursor.execute() is completed.
Therefore, even if you use
for row in cursor:
you will not be getting any reduction in memory footprint. The entire result set has already been stored in a list (See self._rows in MySQLdb/cursors.py).
However, if you use an SSCursor or SSDictCursor:
import MySQLdb
import MySQLdb.cursors as cursors
conn = MySQLdb.connect(..., cursorclass=cursors.SSCursor)
then the result set is stored in the server, mysqld. Now you can write
cursor = conn.cursor()
cursor.execute('SELECT * FROM HUGETABLE')
for row in cursor:
print(row)
and the rows will be fetched one-by-one from the server, thus not requiring Python to build a huge list of tuples first, and thus saving on memory.
Otherwise, as others have already stated, cursor.fetchall() and list(cursor) are essentially the same.
cursor.fetchall() and list(cursor) are essentially the same. The different option is to not retrieve a list, and instead just loop over the bare cursor object:
for result in cursor:
This can be more efficient if the result set is large, as it doesn't have to fetch the entire result set and keep it all in memory; it can just incrementally get each item (or batch them in smaller batches).
list(cursor) works because a cursor is an iterable; you can also use cursor in a loop:
for row in cursor:
# ...
A good database adapter implementation will fetch rows in batches from the server, saving on the memory footprint required as it will not need to hold the full result set in memory. cursor.fetchall() has to return the full list instead.
There is little point in using list(cursor) over cursor.fetchall(); the end effect is then indeed the same, but you wasted an opportunity to stream results instead.
A (MySQLdb/PyMySQL-specific) difference worth noting when using a DictCursor is that list(cursor) will always give you a list, while cursor.fetchall() gives you a list unless the result set is empty, in which case it gives you an empty tuple. This was the case in MySQLdb and remains the case in the newer PyMySQL, where it will not be fixed for backwards-compatibility reasons. While this isn't a violation of Python Database API Specification, it's still surprising and can easily lead to a type error caused by wrongly assuming that the result is a list, rather than just a sequence.
Given the above, I suggest always favouring list(cursor) over cursor.fetchall(), to avoid ever getting caught out by a mysterious type error in the edge case where your result set is empty.
You could use list comprehensions to bring the item in your tuple into a list:
conn = mysql.connector.connect()
cursor = conn.cursor()
sql = "SELECT column_name FROM db.table_name;"
cursor.execute(sql)
results = cursor.fetchall()
# bring the first item of the tuple in your results here
item_0_in_result = [_[0] for _ in results]
Does anyone know how to get the row count from an SQL Alchemy query ResultProxy object without looping through the result set? The ResultProxy.rowcount attribute shows 0, I would expect it to have a value of 2. For updates it shows the number of rows affected which is what I would expect.
from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker
engine = create_engine(
'oracle+cx_oracle://user:pass#host:port/database'
)
session = sessionmaker(
bind = engine
, autocommit = False
, autoflush = False
)()
sql_text = u"""
SELECT 1 AS Val FROM dual UNION ALL
SELECT 2 AS Val FROM dual
"""
results = session.execute(sql_text)
print '%s rows returned by query...\n' % results.rowcount
print results.keys()
for i in results:
print repr(i)
Output:
0 rows returned by query...
[u'val']
(1,)
(2,)
resultproxy.rowcount is ultimately a proxy for the DBAPI attribute cursor.rowcount. Most DBAPIs do not provide the "count of rows" for a SELECT query via this attribute; its primary purpose is to provide the number of rows matched by an UPDATE or DELETE statement. A relational database in fact does not know how many rows would be returned by a particular statement until it has finished locating all of those rows; many DBAPI implementations will begin returning rows as the database finds them, without buffering, so no such count is even available in those cases.
To get the count of rows a SELECT query would return, you either need to do a SELECT COUNT(*) up front, or you need to fetch all the rows into an array and perform len() on the array.
The notes at ResultProxy.rowcount discuss this further (http://docs.sqlalchemy.org/en/latest/core/connections.html?highlight=rowcount#sqlalchemy.engine.ResultProxy.rowcount):
Notes regarding ResultProxy.rowcount:
This attribute returns the number of rows matched, which is not necessarily the same as the number of rows that were actually modified - an UPDATE statement, for example, may have no net change on a given row if the SET values given are the same as those present in the row
already. Such a row would be matched but not modified. On backends
that feature both styles, such as MySQL, rowcount is configured by
default to return the match count in all cases.
ResultProxy.rowcount is only useful in conjunction with an UPDATE or DELETE statement. Contrary to what the Python DBAPI says, it does
not return the number of rows available from the results of a SELECT
statement as DBAPIs cannot support this functionality when rows are
unbuffered.
ResultProxy.rowcount may not be fully implemented by all dialects. In particular, most DBAPIs do not support an aggregate rowcount result
from an executemany call. The ResultProxy.supports_sane_rowcount() and ResultProxy.supports_sane_multi_rowcount() methods will report from
the dialect if each usage is known to be supported.
Statements that use RETURNING may not return a correct rowcount.
You could use this:
rowcount = len(results._saved_cursor._result.rows)
Then your code will be
print '%s rows returned by query...\n' % rowcount
print results.keys()
Only tested 'find' queries
It works for me.