I have a large dataset to return from Oracle, too large to fit into memory.
I need to re-walk the entire dataset many times.
Because of the size of the dataset, rerunning the query all the time is obviously not an option.
Is there a way to access a scrollable Cursor from Oracle? I'm using cx_Oracle.
In PostgreSQL, I can just do cursor.scroll(0, mode='absolute') to send the cursor back to beginning of dataset.
Google suggests that OCI 8 supports scrollable clientside cursor, and has C examples for constructing such a cursor. The cx_Oracle documentation doesn't show a Cursor.scroll method, even though it's specified as part of DB-API 2.0.
Am I going to be stuck using pyodbc or something?
Short answer, no.
Longer answer...
Although Cursor.scroll() is specified as part of PEP 249, it's in the Optional DB API Extensions section, to quote:
As with all DB API optional features, the database module authors are free to not implement these additional attributes and methods (using them will then result in an AttributeError) or to raise a NotSupportedError in case the availability can only be checked at run-time.
This simply hasn't been implemented in cx_Oracle, though, as you say, it is possible with OCI.
You mention that the dataset is too large to fit into memory, I assume you mean client-side? Have you considered letting the database shoulder the burden? You don't mention what your query is, how complicated, actually how much data is returned etc but you could consider caching the result-set. There are a few options here, and both the database and the OS will do it in the background anyway, however, the main one would be to use the RESULT_CACHE hint:
select /*+ result_cache */ ...
from ...
The amount of memory you can use is based on the RESULT_CACHE_MAX_SIZE initialization parameter, the value of which you can find by running the following query
select *
from v$parameter
where name = 'result_cache_max_size'
How useful this is depends on the amount of work your database is doing, the size of the parameter, etc. There's a lot of information available on the subject.
Another option might be to use a global temporary table (GTT) to persist the results. Use the cursor to insert data into the GTT and then your select becomes
select * from temp_table
I can see one main benefit, you'll be able to access the table by the index of the row, as you wish to do with the scrollable cursor. Declare your table with an additional column, to indicate the index:
create global temporary table temp_table (
i number
, col1 ...
, primary key (i)
) on commit delete rows
Then insert into it with the ROWNUM psuedocolumn to create the same "index" as you would in Python:
select rownum - 1, c.*
from cursor c
To access the 0th row you can then add the predicate WHERE i = 0, or to "re-start" the cursor, you can simply re-select. Because the data is stored "flat", re-accessing should be a lot quicker.
Related
I searched the web and the Stack Overflow site in particular, and I couldn't find any simple explanation as to the role a cursor plays in PyMySQL. Why is it required? what function does it fulfill? Can I have multiple cursors? Can I pass it as an argument to a class or a function?
Looking at tutorials with examples I wrote code that uses cursors and does work. But so far the use of cursors is counter intuitive to me without really understanding their role and function.
Please help...
The cursor in MySQL is used in most cases to retrieve rows from your resultset and then perform operations on that data. The cursor enables you to iterate over returned rows from an SQL query.
Here is an example.
1) First we declare a cursor:
DECLARE cursor_name CURSOR FOR SELECT_statement;
2) Let's open the cursor.
OPEN cursor_name;
3) Now we can use the FETCH statement to retrieve the next row in the result set.
(Recall the syntax for the FETCH statement: FETCH [ NEXT [ FROM ] ] cursor_name INTO variable_list;. As you can see, cursor is within the syntax, so it is a vital part of the FETCH statement).
FETCH cursor_name INTO variable_list;
4) Summary: Okay, so we have used our cursor_name to FETCH the next row, and we store that in variable_list (a list of variables, comma-separated, where the cursor result should be stored).
This should illustrate the following:
FETCH use MySQL cursors to fetch the next row in a resultset.
The cursor is a tool to iterate over your rows in a resultset, one row at a time.
The pymysql cursor
PyMySQL is used to "interact" with the database. However, take a look at PEP 249 which defines the Python Database API Specification.
PyMySQL is based on the PEP 249 specification, so the cursor is derived from the PEP 249 specification.
And in PEP 249 we see this:
https://www.python.org/dev/peps/pep-0249/#cursor-objects
"Cursor Objects
These objects represent a database cursor, which is used to manage the context of a fetch operation. Cursors created from the same connection are not isolated, i.e., any changes done to the database by a cursor are immediately visible by the other cursors. Cursors created from different connections can or can not be isolated, depending on how the transaction support is implemented (see also the connection's .rollback() and .commit() methods)."
I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.
Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.
I have a database with a large table containing more that a hundred million rows. I want to export this data (after some transformation, like joining this table with a few others, cleaning some fields, etc.) and store it int a big text file, for later processing with Hadoop.
So far, I tried two things:
Using Python, I browse the table by chunks (typically 10'000 records at a time) using this subquery trick, perform the transformation on each row and write directly to a text file. The trick helps, but the LIMIT becomes slower and slower as the export progresses. I have not been able to export the full table with this.
Using the mysql command-line tool, I tried to output the result of my query in CSV form to a text file directly. Because of the size, it ran out of memory and crashed.
I am currently investigating Sqoop as a tool to import the data directly to HDFS, but I was wondering how other people handle such large-scale exports?
Memory issues point towards using the wrong database query machanism.
Normally, it is advisable to use mysql_store_result() on C level, which corresponds to having a Cursor or DictCursor on Python level. This ensures that the database is free again as soon as possible and the client can do with thedata whatever he wants.
But it is not suitable for large amounts of data, as the data is cached in the client process. This can be very memory consuming.
In this case, it may be better to use mysql_use_result() (C) resp. SSCursor / SSDictCursor (Python). This limits you to have to take the whole result set and doing nothing else with the database connection in the meanwhile. But it saves your client process a lot of memory. With the mysql CLI, you would achieve this with the -q argument.
I don't know what query exactly you have used because you have not given it here, but I suppose you're specifying the limit and offset. This are quite quick queries at begin of data, but are going very slow.
If you have unique column such as ID, you can fetch only the first N row, but modify the query clause:
WHERE ID > (last_id)
This would use index and would be acceptably fast.
However, it should be generally faster to do simply
SELECT * FROM table
and open cursor for such query, with reasonable big fetch size.
I am trying to perform some n-gram counting in python and I thought I could use MySQL (MySQLdb module) for organizing my text data.
I have a pretty big table, around 10mil records, representing documents that are indexed by a unique numeric id (auto-increment) and by a language varchar field (e.g. "en", "de", "es" etc..)
select * from table is too slow and memory devastating.
I ended up splitting the whole id range into smaller ranges (say 2000 records wide each) and processing each of those smaller record sets one by one with queries like:
select * from table where id >= 1 and id <= 1999
select * from table where id >= 2000 and id <= 2999
and so on...
Is there any way to do it more efficiently with MySQL and achieve similar performance to reading a big corpus text file serially?
I don't care about the ordering of the records, I just want to be able to process all the documents that pertain to a certain language in my big table.
You can use the HANDLER statement to traverse a table (or index) in chunks. This is not very portable and works in an "interesting" way with transactions if rows appear and disappear while you're looking at it (hint: you're not going to get consistency) but makes code simpler for some applications.
In general, you are going to get a performance hit, as if your database server is local to the machine, several copies of the data will be necessary (in memory) as well as some other processing. This is unavoidable, and if it really bothers you, you shouldn't use mysql for this purpose.
Aside from having indexes defined on whatever columns you're using to filter the query (language and ID probably, where ID already has an index care of the primary key), no.
First: you should avoid using * if you can specify the columns you need (lang and doc in this case). Second: unless you change your data very often, I don't see the point of storing all
this in a database, especially if you are storing file names. You could use an xml format for example (and read/write with a SAX api)
If you want a DB and something faster than MySQL, you can consider an in-memory databasy such as SQLite or BerkeleyDb, which have both python bindings.
Greetz,
J.
The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);
The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.
In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.