Can I use ? in SQL to select a table - python

I have an SQL DB with 3/4 tables in it. I would like to write 1 Python function that could use parameters to search the relevant tables. Is this possible?
The line I was thinking would be
self.cur.execute(select * from ? (Table))
this obviously works when choosing a column but I cannot get it to work for table. Is it possible, or should I change plan?

No. Tables are similar to types in a strongly-typed language, not parameters.
Queries aren't executed like scripts. They are compiled into execution plans, using different operators depending on the table schema, indexes and statistics, ie the number of rows and distribution of values. For the same JOIN, the query optimizer may decide to use a HASH JOIN for unordered, unindexed data or nested loops if the join columns are indexed. Or a MERGE join can be used if the data from both tables is ordered.
Even for the same query, a very different execution plan may be generated if the table contains a few dozen or a few million rows
Parameters are passed to that execution plan the same way parameters are passed to a method. They are even passed separately from the SQL text in the RPC call from client to server. That's why they aren't vulnerable to SQL injection - they are never part of the query itself.

Related

Python or R -- create a SQL join using a dataframe

I am trying to find a way, either in R or Python, to use a dataframe as a table in an Oracle SQL statement.
It is impractical, for my objective, to:
Create a string out of a column and use that as a criteria (more than a 1k, which is the limit)
Create a new table in the database and use that (don't have access)
Download the entire contents of the table and merge in pandas (millions of records in the database and would bog down the db and my system)
I have found packages that will allow you to "register" a dataframe and have it act as a "table/view" to allow queries against it, but it will not allow them to be used in a query with a different connection string. Can anyone point me in the right direction? Either to allow two different connections in the same SQL statement (to Oracle and a package like DuckDB) to permit an inner join or direct link to the dataframe and allow that to be used as a table in a join?
SAS does this so effortlessly and I don't want to go back to SAS because the other functionality is not as good as Python / R, but this is a dealbreaker if I can't do database extractions.
Answering my own question here -- after much research.
In short, this cannot be done. A series of criteria, outside of a list or concat, you cannot create a dataframe in python or R and pass it through a query into a SQL Server or Oracle database. It's unfortunate, but if you don't have permissions to write to temporary tables in the Oracle database, you're out of options.

How to insert a Pandas Dataframe into a SQL Server synonym table

Summary
In SQL Server, synonyms are often used to abstract a remote table into the current database context. Normal DML operations work just fine on such a construct, but SQL Server does track synonyms as their own object type separately from tables.
I'm attempting to leverage the pandas DataFrame#to_sql method to facilitate loading a synonym, and while it works well when the table is local to the database, it is unable to locate the table via synonym and instead attempts to create a new table coordinating with the DataFrame's structure, which results in an object name collision and undesirable behavior.
Tracking through the source, it looks like pandas leverages the dialect's has_table method, which in this case tracks to SQL Alchemy's MSSQL dialect implementation, which then queries the INFORMATION_SCHEMA.columns view as a way to verify whether the table exists.
Unfortunately, synonym tables don't appear in INFORMATION_SCHEMA views like this. In the answer for "How to find all column names of a synonym", the answerer provides a technique for establishing a synonym's columns, which may be applicable here.
The Question
Is there any method available which can optionally skip table existence checks during DataFrame#to_sql? If not, is there any way to force pandas or SQL Alchemy to recognize a synonym? I couldn't find any similar questions on SO, and neither git had an issue resembling this either.
I've accepted my own answer, but if anyone has a better technique for loading DataFrames to SQL Server synonyms, please post it!
SQL Alchemy on SQL Server doesn't currently support synonym tables, which means that the DataFrame#to_sql method cannot insert to them and another technique must be employed.
As of SQL Alchemy 1.2, the Oracle dialect supports Synonym/DBLINK Reflection, but no similar feature is available for SQL Server, even on the upcoming SQL Alchemy 1.4 release.
For those trying to solve this in different ways, if your situation meets the following criteria:
Your target synonym is already declared in the ORM as a table
The table's column names match the column names in the DataFrame
The table's column data types either match the DataFrame or can be casted without error
You can perform the following bulk_insert_mappings operation, with TargetTable defining your target in the ORM model and df defining your DataFrame:
db.session.bulk_insert_mappings(
TargetTable, df.to_dict('records')
)
As a bonus, this is substantially faster than the DataFrame#to_sql operation as well!

Python sqite3 user defined queries (selecting tables)

I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.
Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.

Diffing and Synchronizing 2 tables MySQL

I have 2 tables, One with new data, and another with old data.
I need to find the diff between the two tables and push only the changes into the table with the old data as it will be in production.
Both the tables are identical in terms of columns, only the data varies.
EDIT:
I am looking for only one way sync
EDIT 2
The table may have foreign keys.
Here are the constraints
I can't use shell utilities like mk-table-sync
I can't use gui tools,because they cannot be automated, like suggested here.
This needs to be done programmatically, or in the db.
I am working in python on Google App-engine.
Currently I am doing things like
OUTER JOINs and WHERE [NOT] EXISTS to compare each record in SQL queries and pushing the results.
My questions are
Is there a better way to do this ?
Is it better to do this in python rather than in the db ?
According to your comment to my question, you could simply do:
DELETE FROM OldTable;
INSERT INTO OldTable (field1, field2, ...) SELECT * FROM NewTable;
As I pointed out above, there might be reasons not to do this, e.g., data size.

Why does not postgresql start returning rows immediately?

The following query returns data right away:
SELECT time, value from data order by time limit 100;
Without the limit clause, it takes a long time before the server starts returning rows:
SELECT time, value from data order by time;
I observe this both by using the query tool (psql) and when querying using an API.
Questions/issues:
The amount of work the server has to do before starting to return rows should be the same for both select statements. Correct?
If so, why is there a delay in case 2?
Is there some fundamental RDBMS issue that I do not understand?
Is there a way I can make postgresql start returning result rows to the client without pause, also for case 2?
EDIT (see below). It looks like setFetchSize is the key to solving this. In my case I execute the query from python, using SQLAlchemy. How can I set that option for a single query (executed by session.execute)? I use the psycopg2 driver.
The column time is the primary key, BTW.
EDIT:
I believe this excerpt from the JDBC driver documentation describes the problem and hints at a solution (I still need help - see the last bullet list item above):
By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.
and
Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).
// make sure autocommit is off
conn.setAutoCommit(false);
Statement st = conn.createStatement();
// Turn use of the cursor on.
st.setFetchSize(50);
The psycopg2 dbapi driver buffers the whole query result before returning any rows. You'll need to use server side cursor to incrementally fetch results. For SQLAlchemy see server_side_cursors in the docs and if you're using the ORM the Query.yield_per() method.
SQLAlchemy currently doesn't have an option to set that per single query, but there is a ticket with a patch for implementing that.
In theory, because your ORDER BY is by primary key, a sort of the results should not be necessary, and the DB could indeed return data right away in key order.
I would expect a capable DB of noticing this, and optimizing for it. It seems that PGSQL is not. * shrug *
You don't notice any impact if you have LIMIT 100 because it's very quick to pull those 100 results out of the DB, and you won't notice any delay if they're first gathered up and sorted before being shipped out to your client.
I suggest trying to drop the ORDER BY. Chances are, your results will be correctly ordered by time anyway (there may even be a standard or specification that mandates this, given your PK), and you might get your results more quickly.

Categories

Resources