query and insert simultaneously to a mongodb collection - python

What will happen to query and insert simultaneously to a mongodb collection.
For example,
process-I:
for page in coll.find():
# access page
process-II:
for page in gen_pages():
# coll.insert(page)
Will the find() in process-I return the new insertions from process-II?
Suppose that coll is huge and process-II will terminate before process-I
Sincere Thanks~

Cursors are not isolated in MongoDB. So, assuming that the find method uses a MongoDB cursor internally (which I believe it does), the results are affected by changes to the data from inserts, etc. So, depending on the nature of the query and the data that was inserted, the new values could appear in the results. There are a number of factors including where the cursor is currently pointing, sorting, when locks are taken, number of documents requested by the cursor operation, ....

Related

What function and role does a cursor play in PyMySQL?

I searched the web and the Stack Overflow site in particular, and I couldn't find any simple explanation as to the role a cursor plays in PyMySQL. Why is it required? what function does it fulfill? Can I have multiple cursors? Can I pass it as an argument to a class or a function?
Looking at tutorials with examples I wrote code that uses cursors and does work. But so far the use of cursors is counter intuitive to me without really understanding their role and function.
Please help...
The cursor in MySQL is used in most cases to retrieve rows from your resultset and then perform operations on that data. The cursor enables you to iterate over returned rows from an SQL query.
Here is an example.
1) First we declare a cursor:
DECLARE cursor_name CURSOR FOR SELECT_statement;
2) Let's open the cursor.
OPEN cursor_name;
3) Now we can use the FETCH statement to retrieve the next row in the result set.
(Recall the syntax for the FETCH statement: FETCH [ NEXT [ FROM ] ] cursor_name INTO variable_list;. As you can see, cursor is within the syntax, so it is a vital part of the FETCH statement).
FETCH cursor_name INTO variable_list;
4) Summary: Okay, so we have used our cursor_name to FETCH the next row, and we store that in variable_list (a list of variables, comma-separated, where the cursor result should be stored).
This should illustrate the following:
FETCH use MySQL cursors to fetch the next row in a resultset.
The cursor is a tool to iterate over your rows in a resultset, one row at a time.
The pymysql cursor
PyMySQL is used to "interact" with the database. However, take a look at PEP 249 which defines the Python Database API Specification.
PyMySQL is based on the PEP 249 specification, so the cursor is derived from the PEP 249 specification.
And in PEP 249 we see this:
https://www.python.org/dev/peps/pep-0249/#cursor-objects
"Cursor Objects
These objects represent a database cursor, which is used to manage the context of a fetch operation. Cursors created from the same connection are not isolated, i.e., any changes done to the database by a cursor are immediately visible by the other cursors. Cursors created from different connections can or can not be isolated, depending on how the transaction support is implemented (see also the connection's .rollback() and .commit() methods)."

Insertion into SQL Server based on condition

I have a table in SQL Server and the table has already data for month of November. I have to insert data for previous months such as starting from Jan through October. I have data in a spreadsheet. I want to do bulk insert using Python. I have successfully established the connection to the server using Python and able to access the table. However, I don't know how to insert data above the rows those are already present in the table of the server. The table doesn't have any constraints, primary keys and index.
I am not sure whether the insertion is possible based on the condition. If it is possible kindly share some clues.
Notes: I don't have access to SSIS. I can't do insertion using "BULK INSERT" because the I can't map my shared drive with SQL server. That's why I have decided to use python script to do the operation.
SQL Server Management Studio is just the GUI for interacting with SQL Server.
However, I don't know how to insert data above the rows those are
already present in the table of the server
Tables are ordered or structured based off the clustered index. Since you don't have one since you said there aren't any PK's or indexes, inserting the records "below" or "above" won't happen. A table without a clustered index is called a HEAP which is what you have.
Thus, just insert the data. The order will be determined by any order by clauses you place on a statement (at least the order of the results) or the clustered index on the table if you create one.
I assume you think your data is ordered because, by chance, when you run select * from table your results appear to be in the same order each time. However, this blog will show you that this isn't guaranteed and elaborates on the fact that your results truly aren't ordered without an order by clause.

Does cx_Oracle support scrollable cursors?

I have a large dataset to return from Oracle, too large to fit into memory.
I need to re-walk the entire dataset many times.
Because of the size of the dataset, rerunning the query all the time is obviously not an option.
Is there a way to access a scrollable Cursor from Oracle? I'm using cx_Oracle.
In PostgreSQL, I can just do cursor.scroll(0, mode='absolute') to send the cursor back to beginning of dataset.
Google suggests that OCI 8 supports scrollable clientside cursor, and has C examples for constructing such a cursor. The cx_Oracle documentation doesn't show a Cursor.scroll method, even though it's specified as part of DB-API 2.0.
Am I going to be stuck using pyodbc or something?
Short answer, no.
Longer answer...
Although Cursor.scroll() is specified as part of PEP 249, it's in the Optional DB API Extensions section, to quote:
As with all DB API optional features, the database module authors are free to not implement these additional attributes and methods (using them will then result in an AttributeError) or to raise a NotSupportedError in case the availability can only be checked at run-time.
This simply hasn't been implemented in cx_Oracle, though, as you say, it is possible with OCI.
You mention that the dataset is too large to fit into memory, I assume you mean client-side? Have you considered letting the database shoulder the burden? You don't mention what your query is, how complicated, actually how much data is returned etc but you could consider caching the result-set. There are a few options here, and both the database and the OS will do it in the background anyway, however, the main one would be to use the RESULT_CACHE hint:
select /*+ result_cache */ ...
from ...
The amount of memory you can use is based on the RESULT_CACHE_MAX_SIZE initialization parameter, the value of which you can find by running the following query
select *
from v$parameter
where name = 'result_cache_max_size'
How useful this is depends on the amount of work your database is doing, the size of the parameter, etc. There's a lot of information available on the subject.
Another option might be to use a global temporary table (GTT) to persist the results. Use the cursor to insert data into the GTT and then your select becomes
select * from temp_table
I can see one main benefit, you'll be able to access the table by the index of the row, as you wish to do with the scrollable cursor. Declare your table with an additional column, to indicate the index:
create global temporary table temp_table (
i number
, col1 ...
, primary key (i)
) on commit delete rows
Then insert into it with the ROWNUM psuedocolumn to create the same "index" as you would in Python:
select rownum - 1, c.*
from cursor c
To access the 0th row you can then add the predicate WHERE i = 0, or to "re-start" the cursor, you can simply re-select. Because the data is stored "flat", re-accessing should be a lot quicker.

Need help on python sqlite?

1.I have a list of data and a sqlite DB filled with past data along with some stats on each data. I have to do the following operations with them.
Check if each item in the list is present in DB. if no then collect some stats on the new item and add them to DB.
Check if each item in DB is in the list. if no delete it from DB.
I cannot just create a new DB, coz I have other processing to do on the new items and the missing items.
In short, i have to update the DB with the new data in list. What is best way to do it?
2.I had to use sqlite with python threads. So I put a lock for every DB read and write operation. Now it has slowed down the DB access. What is the overhead for thread lock operation? And Is there any other way to use the DB with multiple threads?
Can someone help me on this?I am using python3.1.
It does not need to check anything, just use INSERT OR IGNORE in first case (just make sure you have corresponding unique fields so INSERT would not create duplicates) and DELETE FROM tbl WHERE data NOT IN ('first item', 'second item', 'third item') in second case.
As it is stated in the official SQLite FAQ, "Threads are evil. Avoid them." As far as I remember there were always problems with threads+sqlite. It's not that sqlite is not working with threads at all, just don't rely much on this feature. You can also make single thread working with database and pass all queries to it first, but effectiveness of such approach is heavily dependent on style of database usage in your program.

Why does not postgresql start returning rows immediately?

The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);
The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.
In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.

Categories

Resources