Python sqite3 user defined queries (selecting tables)

Python sqite3 user defined queries (selecting tables) - python

I have a uni assignment where I'm implementing a database that users interact with over a webpage. The goal is to search for books given some criteria. This is one module within a bigger project.
I'd like to let users be able to select the criteria and order they want, but the following doesn't seem to work:
cursor.execute("SELECT * FROM Books WHERE ? REGEXP ? ORDER BY ? ?", [category, criteria, order, asc_desc])
I can't work out why, because when I go
cursor.execute("SELECT * FROM Books WHERE title REGEXP ? ORDER BY price ASC", [criteria])
I get full results. Is there any way to fix this without resorting to injection?
The data is organised in a table where the book's ISBN is a primary key, and each row has many columns, such as the book's title, author, publisher, etc. The user should be allowed to select any of these columns and perform a search.

Generally, SQL engines only support parameters on values, not on the names of tables, columns, etc. And this is true of sqlite itself, and Python's sqlite module.
The rationale behind this is partly historical (traditional clumsy database APIs had explicit bind calls where you had to say which column number you were binding with which value of which type, etc.), but mainly because there isn't much good reason to parameterize values.
On the one hand, you don't need to worry about quoting or type conversion for table and column names. On the other hand, once you start letting end-user-sourced text specify a table or column, it's hard to see what other harm they could do.
Also, from a performance point of view (and if you read the sqlite docs—see section 3.0—you'll notice they focus on parameter binding as a performance issue, not a safety issue), the database engine can reuse a prepared optimized query plan when given different values, but not when given different columns.
So, what can you do about this?
Well, generating SQL strings dynamically is one option, but not the only one.
First, this kind of thing is often a sign of a broken data model that needs to be normalized one step further. Maybe you should have a BookMetadata table, where you have many rows—each with a field name and a value—for each Book?
Second, if you want something that's conceptually normalized as far as this code is concerned, but actually denormalized (either for efficiency, or because to some other code it shouldn't be normalized)… functions are great for that. create_function a wrapper, and you can pass parameters to that function when you execute it.

Related

How to create a large number of tables

I've got a large dataset to work with to create a storage system to monitor movement in a store. There's over like 300 products in that store and the main structure of all tables is the same. The only difference is the data inside. There's a larger data base called StorageTF and I want to create a lot of tables called Product_1,Product_2,Product_3 etc..
The table structure should look like
The main large data set (table) looks like this:
CREATE TABLE StoringTF (
Store_code INTEGER,
Store TEXT,
Product_Date TEXT,
Permission INTEGER,
Product_Code INTEGER,
Product_Name TEXT,
Incoming INTEGER,
Unit_Buying_Price INTEGER,
Total_Buying_Price INTEGER,
Outgoing INTEGER,
Unit_Sell_Price INTEGER,
Total_Sell_Price INTEGER,
Description TEXT)
I want the user to input a code in an entry called PCode
it looks like this
PCode = Entry(root, width=40)
PCode.grid(row=0,column=0)
then a function compares the input with all codes in the main table and takes that one and gets the table that has the same product_code.
So the sequence is. All the product tables for all product_Codes in the main table will be created and will have all data from main table that has same product_code.
Then when the program is opened the user inputs a product_code
the program picks the table that has the same code and shows it to the user.
Thanks a lot and I know it's hard but I really need your help and I'm certain you can help me. Thanks.
The product table should look like
CREATE TABLE Product_x (Product_Code INTEGER,
Product_Name TEXT, --taken from main table from lines that has same product code
Entry_Date, TEXT,
Permission_Number INTEGER,
Incoming INTEGER,
Outgoing INTEGER,
Description TEXT,
Total_Quantity_In_Store INTEGER, --which is main table's incoming - outgoing
Total_Value_In_Store INTEGER --main table's total_buying_price - total_sell_price
)
Thank you for your help and hope you can figure it out because I'm really struggling with it.

From your comment:
I think I'd select some columns from main table but I don't know how I'd update the only some columns with select columns from main table where product code = PCode.get() "which is the entry box". is that possible.
Yes, it is definitely possible to present only certain rows and columns of data to the user.
However, there are many patterns (i.e. programming techniques) that you could follow for presenting data to the user, but every common, best-practice technique always separates the backend data (i.e. database) from the user interface. It is not necessary to limit presentation of data to one entire table at a time. In most cases the data should never be presented and/or exposed to the user exactly as it appears in a table. Of course sometimes the data is simple and direct enough to do that, but most applications re-format and group data in different views for proper presentation. (Here the term view is meant as a very general, abstract term for representing data in alternative ways from how it is stored. I mention specific sqlite views below.)
The entire philosophy behind modern databases is for efficient, well-designed storage that can be queried to return just what data is appropriate for each application. Much of this capability is based on the host-language data models, but sqlite directly supports features to help with this. For instance, a view can be defined to select only certain columns and rows at a time (i.e. choose certain Produce_Code values). An sqlite view is just an SQL query that is saved and can have certain properties and actions defined for it. By default, a sqlite view is read-only, but triggers can be defined to allow updates to the underlying tables via the view.
From my earlier comment: You should research data normalization. That is the key principle for designing relational databases. For instance, you should avoid duplicate data columns like Product_Name. That column should only be in the StoringTF. Calculated columns are also usually redundant and unnecessary--don't store the Total_Value_In_Store column, rather calculate it when needed by query and/or view. Having duplicate columns invites mismatched data or at least unnecessary care to make sure all columns are synced when one is updated. Instead you can just query joined tables to get related values.
Honestly, these concepts can require much study before implementing properly. By all means, go forward with developing a solution that fits your needs, but a Stack Overflow answer is no place for full tutorials which I perceive that you might need. Really your question seems more about overall design and I think my answer can get you started on the right track. Anything more specific and you'll need to ask other questions later on.

Building dynamic SQL queries with psycopg2 and postgresql

I'm not really sure the best way to go about this or if i'm just asking for a life that's easier than it should be. I have a backend for a web application and I like to write all of the queries in raw SQL. For instance getting a specific user profile, or a number of users I have a query like this:
SELECT accounts.id,
accounts.username,
accounts.is_brony,
WHERE accounts.id IN %(ids)s;
This is really nice because I can get one user profile, or many user profiles with the same query. Now my real query is actually almost 50 lines long. It has a lot of joins and other conditions for this profile.
Lets say I want to get all of the same information from a user profile but instead of getting a specific user ID i want to get a single random user? I don't think it makes sense to copy and paste 50 lines of code just to modify two lines at the end.
SELECT accounts.id,
accounts.username,
accounts.is_brony,
ORDER BY Random()
LIMIT 1;
Is there some way to use some sort of inheritance in building queries, so that at the end I can modify a couple of conditions while keeping the core similarities the same?
I'm sure I could manage it by concatenating strings and such, but I was curious if there's a more widely accepted method for approaching such a situation. Google has failed me.

The canonical answer is to create a view and use that with different WHERE and ORDER BY clauses in queries.
But, depending on your query and your tables, that might not be a good solution for your special case.
A query that is blazingly fast with WHERE accounts.id IN (1, 2, 3) might perform abysmally with ORDER BY random() LIMIT 1. In that case you'll have to come up with a different query for the second requirement.

Are there serious performance differences between using pickleType and relationships?

Let's say there is a table of People. and let's say that are 1000+ in the system. Each People item has the following fields: name, email, occupation, etc.
And we want to allow a People item to have a list of names (nicknames & such) where no other data is associated with the name - a name is just a string.
Is this exactly what the pickleType is for? what kind of performance benefits are there between using pickle type and creating a Name table to have the name field of People be a one-to-many kind of relationship?

Yes, this is one good use case of sqlalchemy's PickleType field, documented very well here. There are obvious performance advantages to using this.
Using your example, assume you have a People item which uses a one to many database look. This requires the database to perform a JOIN to collect the sub-elements; in this case, the Person's nicknames, if any. However, you have the benefit of having native objects ready to use in your python code, without the cost of deserializing pickles.
In comparison, the list of strings can be pickled and stored as a PickleType in the database, which are internally stores as a LargeBinary. Querying for a Person will only require the database to hit a single table, with no JOINs which will result in an extremely fast return of data. However, you now incur the "cost" of de-pickling each item back into a python object, which can be significant if you're not storing native datatypes; e.g. string, int, list, dict.
Additionally, by storing pickles in the database, you also lose the ability for the underlying database to filter results given a WHERE condition; especially with integers and datetime objects. A native database call can return values within a given numeric or date range, but will have no concept of what the string representing these items really is.
Lastly, a simple change to a single pickle could allow arbitrary code execution within your application. It's unlikely, but must be stated.
IMHO, storing pickles is a nice way to store certain types of data, but will vary greatly on the type of data. I can tell you we use it pretty extensively in our schema, even on several tables with over half a billions records quite nicely.

Python: Dumping Database Data with Peewee

Background
I am looking for a way to dump the results of MySQL queries made with Python & Peewee to an excel file, including database column headers. I'd like the exported content to be laid out in a near-identical order to the columns in the database. Furthermore, I'd like a way for this to work across multiple similar databases that may have slightly differing fields. To clarify, one database may have a user table containing "User, PasswordHash, DOB, [...]", while another has "User, PasswordHash, Name, DOB, [...]".
The Problem
My primary problem is getting the column headers out in an ordered fashion. All attempts thus far have resulted in unordered results, and all of which are less then elegant.
Second, my methodology thus far has resulted in code which I'd (personally) hate to maintain, which I know is a bad sign.
Work so far
At present, I have used Peewee's pwiz.py script to generate the models for each of the preexisting database tables in the target databases, then went and entered all primary and foreign keys. The relations are setup, and some brief tests showed they're associating properly.
Code: I've managed to get the column headers out using something similar to:
for i, column in enumerate(User._meta.get_field_names()):
ws.cell(row=0,column=i).value = column
As mentioned, this is unordered. Also, doing it this way forces me to do something along the lines of
getattr(some_object, title)
to dynamically populate the fields accordingly.
Thoughts and Possible Solutions
Manually write out the order that I want stuff in an array, and use that for looping through and populating data. The pros of this is very strict/granular control. The cons are that I'd need to specify this for every database.
Create (whether manually or via a method) a hash of fields with an associated weighted value for all possibly encountered fields, then write a method for sorting "_meta.get_field_names()" according to weight. The cons of this is that the columns may not be 100% in the right order, such as Name coming before DOB in one DB, while after it in another.
Feel free to tell me I'm doing it all wrong or suggest completely different ways of doing this, I'm all ears. I'm very much new to Python and Peewee (ORMs in general, actually). I could switch back to Perl and do the database querying via DBI with little to no hassle. However, it's libraries for excel would cause me as many problems, and I'd like to take this as a time to expand my knowledge.

There is a method on the model meta you can use:
for field in User._meta.get_sorted_fields():
print field.name
This will print the field names in the order they are declared on the model.

In my MapperExtension.create_instance, how can I extract individual row data by column name?

I've got a query that returns a fair number of rows, and have found that
We wind up throwing away most of the associated ORM instances; and
building up those soon-to-be-thrown-away instances is pretty slow.
So I'd like to build only the instances that I need!
Unfortunately, I can't do this by simply restricting the query; I need to do a fair bit of "business logic" processing on each row before I can tell if I'll throw it out; I can't do this in SQL.
So I was thinking that I could use a MapperExtension to handle this: I'd subclass MapperExtension, and then override create_instance; that method would examine the row data, and either return EXT_CONTINUE if the data is worth building into an instance, or ... something else (I haven't yet decided what) otherwise.
Firstly, does this approach even make sense?
Secondly, if it does make sense, I haven't figured out how to find the data I need in the arguments that get passed to create_instance. I suspect it's in there somewhere, but it's hard to find ... instead of getting a row that directly corresponds to the particular class I'm interested in, I'm getting a row that corresponds to the query that SQLalchemy generated, which happens to be a somewhat complex join between (say) tables A, B, and C.
The problem is that I don't know which elements of the row correspond to the fields in my ORM class: I want to be able to pluck out (e.g.) A.id, B.weight, and C.height.
I assume that somewhere inside the mapper, selectcontext, or class_ arguments is some sort of mapping between columns of my table, and offsets into the row. But I haven't yet found just the right thing. I've come tantalizingly close, though. For example, I've found that selectcontext.statement.columns contains the names of the generated columns ... but not those of the table I'm interested in. For example:
Column(u'A_id', UUID(), ...
...
Column(u'%(32285328 B)s_weight, MSInt(), ...
...
Column(u'%(32285999 C)s_height', MSInt(), ...
So: how do I map column names like C.height to offsets into the
row?

The row accepts Column objects as indexes:
row[MyClass.some_element.__clause_element__()]
but that will only get you as far as the classes and aliased() constructs you have access to on the outside. Its very likely that would be all you'd need for that part of the issue (even though ultimately the idea won't work, read on).
If your statement has had subqueries wrapped around it, from using things like from_self() or join() to a polymorphic target, the create_instance() method doesn't give you access to the translation functions you'd need to accomplish that.
If you're trying to get at rows that are linked to an eagerload(), that's totally not something you should be doing. eagerload() is about optimizing the load of collections. If you want your query to join between two tables and you're looking to filter on the joined table, use join().
But above all, create_instance() is from version 0.1 of SQLAlchemy and I doubt anyone uses it for anything, and it has no capability to say, "skip this row". It has to return something or the mapper will create the instance on its own. So no matter how well you can interpret the row, there's no hook for what you want to do here.
If i really wanted to do such a thing, it would likely be easier to monkeypatch the "fetchall()" method of the returned ResultProxy to filter rows, and send it to Query.instances(). Any result can be sent to this method. Although, if the Query has done translations and such on the mapped selectables, it would need the original QueryContext as well to know how to translate. But this is nothing I'd be bothering with either.
Overall, if speed is so critical of an issue throughout all of this that creating the object is that big of a difference, I'd make it so that I don't need the mapped objects at all for the whole operation, or I'd use caching, or generate the objects I need manually from a result set. I also would make sure that I have access to all the targeted columns in the selectable I'm using so I can re-fetch from result rows, which means I either don't use automatic-subquery/alias generation functions in the ORM, or I use the expression language directly (if you're really hungry for speed and are in the mood to write large tracts of optimizing code, you should probably just be using the expression language).
So the real questions you have to ask here are:
Have you verified that the real difference in speed is creating the object from the row. I.e. not fetching the row, or fetching its columns, etc.
Does the row just have some expensive columns that you don't need? Have you looked into deferred() ?
What are these business rules and why cant they be done in SQL, as stored procedures, etc.
How many thousands of rows are you really skipping here, that its so "slow" to not "skip" them
Have you investigated techniques for having the objects already present, like in-memory caches, preloads, etc. For many scenarios, this fits the bill.
None of this works, and you really want to hack up some home-rolled optimization code. So why not use the SQL expression language directly? If ultimately you're just dealing with a view layer, result rows are quite friendly (they allow "attribute" style access and such), or build some quick "generate an object" routine from it. The ORM presents a very specific use case of the SQL expression language, and if you really need something much more lightweight than it, you're better off skipping it.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.