How can I safely parameterize table/column names in BigQuery SQL? - python

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.

As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

Related

Node-level documentation for sqlglot

Using the Python library sqlglot, where can I find documentation that explains:
Which attributes I should expect to find on which expression nodes types (which arg types does Join, Table, Select, etc. have?)
What overall structure I should expect the AST to have for various kinds of SQL statements? (e.g. that a Select has a "joins" child, which in turn has a list of tables) And what "arg" name do I use to access each of these?
For example, what documentation could I look at to know that code like below (from here) will find the names of table within the joins? How would I know to request "joins" from node.args? What does "this" refer to?
node = sqlglot.parse_one(sql)
for join in node.args["joins"]:
table = join.find(exp.Table).text("this")
My use case is to parse a bunch of Hive SQL scripts in order to find FROM, INSERT, ADD/DROP TABLE statements/clauses within the scripts, for analyzing which statements interact with which tables. So I am using sqlglot as a general-purpose SQL parser/AST, rather than as a SQL translator.
I have generated a copy of the pdocs locally, but it only tells me which Python API methods are available on the Expression nodes. It does not seem to answer the questions above, unless I am looking in the wrong place.
You can look in the expressions.py file
https://github.com/tobymao/sqlglot/blob/main/sqlglot/expressions.py
every expression type has arg_types

Display data as insert statements

I need to move data from one database to another.
I can use python my counterpart can't.
How can select all data from a table and save it as insert statements.
Using SQLalchemy.
Is there a way to create a back up like this?
As others have suggested in comments, using the database backup program (mysqldump, pg_dump, etc) is your best bet; that will make sure that the data is transferred correctly for the underlying database.
Outputting INSERT statements will be risky; even the built-in SQLAlchemy facility for doing this comes with a big red warning, complete with a picture of a dragon, indicating that it can be dangerous.
If you nevertheless need to do this, and the data is generally trusted and doesn't contain much in the way of odd types, you can use:
Create (but do not execute) an insert expression as though you were inserting the rows back into the database.
Use the .compile() method with the relevant dialect parameter and literal_binds set to True.
Manually double-check that the output is, in fact, valid for the database; as per the warning in the SQLAlchemy FAQ, this method is not very dependable and may expose you to attacks if it's part of any production system.
I wouldn't recommend formatting up INSERT statements by hand; you're unlikely to do a better job than SQLAlchemy...

SQL to handle table updates in a "dynamically typed" fashion

I'm playing around with Python 3's sqlite3 module, and acquainting myself with SQL in the process.
I've written a toy program to hash a salted password and store it, the associated username, and the salt into a database. I thought it would be intuitive to create a function of the signature:
def store(table, data, database=':memory:')
Callable as, for example, store('logins', {'username': 'bob', 'salt': 'foo', 'salted_hash' : 'bar'}), and be able to individually add into logins, into new a row, the value bob for username, foo for salt, et caetera.
Unfortunately I'm swamped with what SQL to code. I'm trying to do this in a "dynamically typed" fashion, in that I won't be punished for storing the wrong types, or be able to add new columns at will, for example.
I want the function to, sanitizing all input:
Check if the table exists, and create it if it doesn't, with the passed keys from the dictionary as the columns;
If the table already exists, check if a table has the specified columns (the keys to the passed dictionary), and add them if it doesn't (is this even possible with SQL?);
Add the individual values from my dictionary to the appropriate columns in the dictionary.
I can use INSERT for the latter, but it seems very rigid. What happens if the columns don't exist, for example? How could we then add them?
I don't mind whether the code is tailored to Python 3's sqlite3, or just the SQL as an outline; as long as I can work it and use it to some extent (and learn from it) I'm very grateful.
(On a different note, I'm wondering what other approaches I could use instead of a SQL'd relational database; I've used Amazon's SimpleDB before and have considered using that for this purpose as it was very "dynamically typed", but I want to know what SQL code I'd have to use for this purpose.)
SQLite3 is dynamically typed, so no problem there.
CREATE TABLE IF NOT EXISTS <name> ... See here.
You can see if the columns you need already exist in the table by using sqlite_master documented in this FAQ. You'll need to parse the sql column, but since it's exactly what your program provided to create the table, you should know the syntax.
If the column does not exist, you can ALTER TABLE <nam>? ADD COLUMN ... See here.

Reverse Search Best Practices?

I'm making an app that has a need for reverse searches. By this, I mean that users of the app will enter search parameters and save them; then, when any new objects get entered onto the system, if they match the existing search parameters that a user has saved, a notification will be sent, etc.
I am having a hard time finding solutions for this type of problem.
I am using Django and thinking of building the searches and pickling them using Q objects as outlined here: http://www.djangozen.com/blog/the-power-of-q
The way I see it, when a new object is entered into the database, I will have to load every single saved query from the db and somehow run it against this one new object to see if it would match that search query... This doesn't seem ideal - has anyone tackled such a problem before?
At the database level, many databases offer 'triggers'.
Another approach is to have timed jobs that periodically fetch all items from the database that have a last-modified date since the last run; then these get filtered and alerts issued. You can perhaps put some of the filtering into the query statement in the database. However, this is a bit trickier if notifications need to be sent if items get deleted.
You can also put triggers manually into the code that submits data to the database, which is perhaps more flexible and certainly doesn't rely on specific features of the database.
A nice way for the triggers and the alerts to communicate is through message queues - queues such as RabbitMQ and other AMQP implementations will scale with your site.
The amount of effort you use to solve this problem is directly related to the number of stored queries you are dealing with.
Over 20 years ago we handled stored queries by treating them as minidocs and indexing them based on all of the must have and may have terms. A new doc's term list was used as a sort of query against this "database of queries" and that built a list of possibly interesting searches to run, and then only those searches were run against the new docs. This may sound convoluted, but when there are more than a few stored queries (say anywhere from 10,000 to 1,000,000 or more) and you have a complex query language that supports a hybrid of Boolean and similarity-based searching, it substantially reduced the number we had to execute as full-on queries -- often no more that 10 or 15 queries.
One thing that helped was that we were in control of the horizontal and the vertical of the whole thing. We used our query parser to build a parse tree and that was used to build the list of must/may have terms we indexed the query under. We warned the customer away from using certain types of wildcards in the stored queries because it could cause an explosion in the number of queries selected.
Update for comment:
Short answer: I don't know for sure.
Longer answer: We were dealing with a custom built text search engine and part of it's query syntax allowed slicing the doc collection in certain ways very efficiently, with special emphasis on date_added. We played a lot of games because we were ingesting 4-10,000,000 new docs a day and running them against up to 1,000,000+ stored queries on a DEC Alphas with 64MB of main memory. (This was in the late 80's/early 90's.)
I'm guessing that filtering on something equivalent to date_added could be done used in combination the date of the last time you ran your queries, or maybe the highest id at last query run time. If you need to re-run the queries against a modified record you could use its id as part of the query.
For me to get any more specific, you're going to have to get a lot more specific about exactly what problem you are trying to solve and the scale of the solution you are trying accomplishing.
If you stored the type(s) of object(s) involved in each stored search as a generic relation, you could add a post-save signal to all involved objects. When the signal fires, it looks up only the searches that involve its object type and runs those. That probably will still run into scaling issues if you have a ton of writes to the db and a lot of saved searches, but it would be a straightforward Django approach.

Dynamic column formatting in SQL - and a backend to store the formatting

I'm trying to create a system in Python in which one can select a number of rows from a set of tables, which are to be formatted in a user-defined way. Let's say the table a has a set of columns, some of which include a date or timestamp value. The user-defined format for each column should be stored in another table, and queried and applied on the main query at runtime.
Let me give you an example: There are different ways of formatting a date column, e.g. using
SELECT to_char(column, 'YYYY-MM-DD') FROM table;
in PostgreSQL.
For example, I'd like the second parameter of the to_char() builtin to be queried dynamically from another table at runtime, and then applied if it has a value.
Reading the definition from a table as such is not that much of a problem, rather than creating a database scheme which would receive data from a user interface from which a user can select which formatting instructions to apply to the different columns. The user should be able to pick the user's set of columns to be included in the user's query, as well as the user's user defined formatting for each column.
I've been thinking about doing this in an elegant and efficient way for some days now, but to no avail. Having the user put in the user's desired definition in a text field and including it in a query would pretty much generate an invitation for SQL injection attacks (although I could use escape() functions), and storing every possible combination doesn't seem feasible to me either.
It seems to me a stored procedure or a sub-select would work well here, though I haven't tested it. Let's say you store a date_format for each user in the users table.
SELECT to_char((SELECT date_format FROM users WHERE users.id=123), column) FROM table;
Your mileage may vary.
Pull the dates out as Unix timestamps and format them in Python:
SELECT DATE_PART('epoch', TIMESTAMP(my_col)) FROM my_table;
my_date = datetime.datetime.fromtimestamp(row[0]) # Or equivalent for your toolkit
I've found a couple of advantages to this approach: unix timestamps are the most space-efficient common format (this approach is effectively language neutral) and because the language you're querying the database in is richer than the underlying database, giving you plenty of options if you start wanting to do friendlier formatting like "today", "yesterday", "last week", "June 23rd".
I don't know what sort of application you're developing but if it's something like a web app which will be used by multiple people I'd also consider storing your database values in UTC so you can apply user-specific timezone settings when formatting without having to consider them for all of your database operations.

Categories

Resources