What function and role does a cursor play in PyMySQL? - python

I searched the web and the Stack Overflow site in particular, and I couldn't find any simple explanation as to the role a cursor plays in PyMySQL. Why is it required? what function does it fulfill? Can I have multiple cursors? Can I pass it as an argument to a class or a function?
Looking at tutorials with examples I wrote code that uses cursors and does work. But so far the use of cursors is counter intuitive to me without really understanding their role and function.
Please help...

The cursor in MySQL is used in most cases to retrieve rows from your resultset and then perform operations on that data. The cursor enables you to iterate over returned rows from an SQL query.
Here is an example.
1) First we declare a cursor:
DECLARE cursor_name CURSOR FOR SELECT_statement;
2) Let's open the cursor.
OPEN cursor_name;
3) Now we can use the FETCH statement to retrieve the next row in the result set.
(Recall the syntax for the FETCH statement: FETCH [ NEXT [ FROM ] ] cursor_name INTO variable_list;. As you can see, cursor is within the syntax, so it is a vital part of the FETCH statement).
FETCH cursor_name INTO variable_list;
4) Summary: Okay, so we have used our cursor_name to FETCH the next row, and we store that in variable_list (a list of variables, comma-separated, where the cursor result should be stored).
This should illustrate the following:
FETCH use MySQL cursors to fetch the next row in a resultset.
The cursor is a tool to iterate over your rows in a resultset, one row at a time.
The pymysql cursor
PyMySQL is used to "interact" with the database. However, take a look at PEP 249 which defines the Python Database API Specification.
PyMySQL is based on the PEP 249 specification, so the cursor is derived from the PEP 249 specification.
And in PEP 249 we see this:
https://www.python.org/dev/peps/pep-0249/#cursor-objects
"Cursor Objects
These objects represent a database cursor, which is used to manage the context of a fetch operation. Cursors created from the same connection are not isolated, i.e., any changes done to the database by a cursor are immediately visible by the other cursors. Cursors created from different connections can or can not be isolated, depending on how the transaction support is implemented (see also the connection's .rollback() and .commit() methods)."

Related

Preventing writable modifications to Oracle database, using Python.

Currently using cx_Oracle module in Python to connect to my Oracle database. I would like to only allow the user of the program to do read only executions, like Select, and NOT INSERT/DELETE queries.
Is there something I can do to the connection/cursor variables once I establish the connection to prevent writable queries?
I am using the Python Language.
Appreciate any help.
Thanks.
One possibility is to issue the statement "set transaction read only" as in the following code:
import cx_Oracle
conn = cx_Oracle.connect("cx_Oracle/welcome")
cursor = conn.cursor()
cursor.execute("set transaction read only")
cursor.execute("insert into c values (1, 'test')")
That will result in the following error:
ORA-01456: may not perform insert/delete/update operation inside a READ ONLY transaction
Of course you'll have to make sure that you create a Connection class that calls this statement when it is first created and after each and every commit() and rollback() call. And it can still be circumvented by calling a PL/SQL block that performs a commit or rollback.
The only other possibility that I can think of right now is to create a restricted user or role which simply doesn't have the ability to insert, update, delete, etc. and make sure the application uses that user or role. This one at least is fool proof, but a lot more effort up front!

What's psycopg2 doing when I iterate a cursor?

I'm trying to understand what this code is doing behind the scenes:
import psycopg2
c = psycopg2.connect('db=some_db user=me').cursor()
c.execute('select * from some_table')
for row in c:
pass
Per PEP 249 my understanding was that this was repeatedly calling Cursor.next() which is the equivalent of calling Cursor.fetchone(). However, the psycopg2 docs say the following:
When a database query is executed, the Psycopg cursor usually fetches
all the records returned by the backend, transferring them to the
client process.
So I'm confused -- when I run the code above, is it storing the results on the server and fetching them one by one, or is it bringing over everything at once?
It depends on how you configure psycopg2. See itersize and server side cursors.
By default it fetches all rows into client memory, then just iterates over the fetched rows with the cursor. But per the above docs, you can configure batch fetches from a server-side cursor instead.

Does cx_Oracle support scrollable cursors?

I have a large dataset to return from Oracle, too large to fit into memory.
I need to re-walk the entire dataset many times.
Because of the size of the dataset, rerunning the query all the time is obviously not an option.
Is there a way to access a scrollable Cursor from Oracle? I'm using cx_Oracle.
In PostgreSQL, I can just do cursor.scroll(0, mode='absolute') to send the cursor back to beginning of dataset.
Google suggests that OCI 8 supports scrollable clientside cursor, and has C examples for constructing such a cursor. The cx_Oracle documentation doesn't show a Cursor.scroll method, even though it's specified as part of DB-API 2.0.
Am I going to be stuck using pyodbc or something?
Short answer, no.
Longer answer...
Although Cursor.scroll() is specified as part of PEP 249, it's in the Optional DB API Extensions section, to quote:
As with all DB API optional features, the database module authors are free to not implement these additional attributes and methods (using them will then result in an AttributeError) or to raise a NotSupportedError in case the availability can only be checked at run-time.
This simply hasn't been implemented in cx_Oracle, though, as you say, it is possible with OCI.
You mention that the dataset is too large to fit into memory, I assume you mean client-side? Have you considered letting the database shoulder the burden? You don't mention what your query is, how complicated, actually how much data is returned etc but you could consider caching the result-set. There are a few options here, and both the database and the OS will do it in the background anyway, however, the main one would be to use the RESULT_CACHE hint:
select /*+ result_cache */ ...
from ...
The amount of memory you can use is based on the RESULT_CACHE_MAX_SIZE initialization parameter, the value of which you can find by running the following query
select *
from v$parameter
where name = 'result_cache_max_size'
How useful this is depends on the amount of work your database is doing, the size of the parameter, etc. There's a lot of information available on the subject.
Another option might be to use a global temporary table (GTT) to persist the results. Use the cursor to insert data into the GTT and then your select becomes
select * from temp_table
I can see one main benefit, you'll be able to access the table by the index of the row, as you wish to do with the scrollable cursor. Declare your table with an additional column, to indicate the index:
create global temporary table temp_table (
i number
, col1 ...
, primary key (i)
) on commit delete rows
Then insert into it with the ROWNUM psuedocolumn to create the same "index" as you would in Python:
select rownum - 1, c.*
from cursor c
To access the 0th row you can then add the predicate WHERE i = 0, or to "re-start" the cursor, you can simply re-select. Because the data is stored "flat", re-accessing should be a lot quicker.

query and insert simultaneously to a mongodb collection

What will happen to query and insert simultaneously to a mongodb collection.
For example,
process-I:
for page in coll.find():
# access page
process-II:
for page in gen_pages():
# coll.insert(page)
Will the find() in process-I return the new insertions from process-II?
Suppose that coll is huge and process-II will terminate before process-I
Sincere Thanks~
Cursors are not isolated in MongoDB. So, assuming that the find method uses a MongoDB cursor internally (which I believe it does), the results are affected by changes to the data from inserts, etc. So, depending on the nature of the query and the data that was inserted, the new values could appear in the results. There are a number of factors including where the cursor is currently pointing, sorting, when locks are taken, number of documents requested by the cursor operation, ....

Why does not postgresql start returning rows immediately?

The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);
The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.
In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.

Categories

Resources