Node-level documentation for sqlglot - python

Using the Python library sqlglot, where can I find documentation that explains:
Which attributes I should expect to find on which expression nodes types (which arg types does Join, Table, Select, etc. have?)
What overall structure I should expect the AST to have for various kinds of SQL statements? (e.g. that a Select has a "joins" child, which in turn has a list of tables) And what "arg" name do I use to access each of these?
For example, what documentation could I look at to know that code like below (from here) will find the names of table within the joins? How would I know to request "joins" from node.args? What does "this" refer to?
node = sqlglot.parse_one(sql)
for join in node.args["joins"]:
table = join.find(exp.Table).text("this")
My use case is to parse a bunch of Hive SQL scripts in order to find FROM, INSERT, ADD/DROP TABLE statements/clauses within the scripts, for analyzing which statements interact with which tables. So I am using sqlglot as a general-purpose SQL parser/AST, rather than as a SQL translator.
I have generated a copy of the pdocs locally, but it only tells me which Python API methods are available on the Expression nodes. It does not seem to answer the questions above, unless I am looking in the wrong place.

You can look in the expressions.py file
https://github.com/tobymao/sqlglot/blob/main/sqlglot/expressions.py
every expression type has arg_types

Related

how to use multiple table in python sqlitedict in the same file?

In sqlitedict README on github, it says that one of the features is:
Support for multiple tables (=dicts) living in the same database file.
https://github.com/RaRe-Technologies/sqlitedict#features
How can I do that? I have researched but all were workarounds, e.g. nested dictionaries.
In the source code it shows that the class takes tablename as an argument. So, the answer is to use that argument. Kinda simple but wasn't available in the docs.
https://github.com/RaRe-Technologies/sqlitedict/blob/master/sqlitedict.py#L108

How can I safely parameterize table/column names in BigQuery SQL?

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.
As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

Python sqite3 user defined queries (selecting tables)

I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.
Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.

Parsing an xml file and storing it into a database

Is there a generic/automatic way in R or in python to parse xml files with its nodes and attributes, automatically generate mysql tables for storing that information and then populate those tables.
Regarding
Is there a generic/automatic way in R
to parse xml files with its nodes and
attributes, automatically generate
mysql tables for storing that
information and then populate those
tables.
the answer is a good old yes you can, at least in R.
The XML package for R can read XML documents and return R data.frame types in a single call using the xmlToDataFrame() function.
And the RMySQL package can transfer data.frame objects to the database in a single command---including table creation if need be---using the dbWriteTable() function defined in the common DBI backend for R and provided for MySQL by RMySQL.
So in short: two lines can do it, so you can easily write yourself a new helper function that does it along with a commensurate amount of error checking.
They're three separate operations: parsing, table creation, and data population. You can do all three with python, but there's nothing "automatic" about it. I don't think it's so easy.
For example, XML is hierarchical and SQL is relational, set-based. I don't think it's always so easy to get a good relational schema for every single XML stream you can encounter.
There's the XML package for reading XML into R, and the RMySQL package for writing data from R into MySQL.
Between the two there's a lot of work. XML surpasses the scope of a RDBMS like MySQL so something that could handle any XML thrown at it would be either ridiculously complex or trivially useless.
We do something like this at work sometimes but not in python. In that case, each usage requires a custom program to be written. We only have a SAX parser available. Using an XML decoder to get a dictionary/hash in a single step would help a lot.
At the very least you'd have to tell it which tags map to which to tables and fields, no pre-existing lib can know that...

SQL to handle table updates in a "dynamically typed" fashion

I'm playing around with Python 3's sqlite3 module, and acquainting myself with SQL in the process.
I've written a toy program to hash a salted password and store it, the associated username, and the salt into a database. I thought it would be intuitive to create a function of the signature:
def store(table, data, database=':memory:')
Callable as, for example, store('logins', {'username': 'bob', 'salt': 'foo', 'salted_hash' : 'bar'}), and be able to individually add into logins, into new a row, the value bob for username, foo for salt, et caetera.
Unfortunately I'm swamped with what SQL to code. I'm trying to do this in a "dynamically typed" fashion, in that I won't be punished for storing the wrong types, or be able to add new columns at will, for example.
I want the function to, sanitizing all input:
Check if the table exists, and create it if it doesn't, with the passed keys from the dictionary as the columns;
If the table already exists, check if a table has the specified columns (the keys to the passed dictionary), and add them if it doesn't (is this even possible with SQL?);
Add the individual values from my dictionary to the appropriate columns in the dictionary.
I can use INSERT for the latter, but it seems very rigid. What happens if the columns don't exist, for example? How could we then add them?
I don't mind whether the code is tailored to Python 3's sqlite3, or just the SQL as an outline; as long as I can work it and use it to some extent (and learn from it) I'm very grateful.
(On a different note, I'm wondering what other approaches I could use instead of a SQL'd relational database; I've used Amazon's SimpleDB before and have considered using that for this purpose as it was very "dynamically typed", but I want to know what SQL code I'd have to use for this purpose.)
SQLite3 is dynamically typed, so no problem there.
CREATE TABLE IF NOT EXISTS <name> ... See here.
You can see if the columns you need already exist in the table by using sqlite_master documented in this FAQ. You'll need to parse the sql column, but since it's exactly what your program provided to create the table, you should know the syntax.
If the column does not exist, you can ALTER TABLE <nam>? ADD COLUMN ... See here.

Categories

Resources