tag search on postgres via sqlalchemy - python

I'm trying to query a tags column (currently modelled as a character varying array). I would like to find any rows in which the tags column contains the query string as a left anchored substring, and would like to do so using sqlalchemy. My research has led me to learn about different ways of optimizing text search, but several lookup message still require usage of 'unnest'. I am open to changing the column from character varying array to something else (or having a separate, related table for tags), but am also curious on using unnest in sqlalchemy.
eg.
SELECT * FROM batches, UNNEST(tags) t WHERE t like 'poe%';
works and will find a row where tags column is ['math', 'poetry'].
I haven't found the right way to use unnest in sqlalchemy's python orm. Any help appreciated.

Related

How can I safely parameterize table/column names in BigQuery SQL?

I am using python's BigQuery client to create and keep up-to-date some tables in BigQuery that contain daily counts of certain firebase events joined with data from other sources (sometimes grouped by country etc.). Keeping them up-to-date requires the deletion and replacement of data for past days because the day tables for firebase events can be changed after they are created (see here and here). I keep them up-to-date in this way to avoid querying the entire dataset which is very financially/computationally expensive.
This deletion and replacement process needs to be repeated for many tables and so consequently I need to reuse some queries stored in text files. For example, one deletes everything in the table from a particular date onward (delete from x where event_date >= y). But because BigQuery disallows the parameterization of table names (see here) I have to duplicate these query text files for each table I need to do this for. If I want to run tests I would also have to duplicate the aforementioned queries for test tables too.
I basically need something like psycopg2.sql for bigquery so that I can safely parameterize table and column names whilst avoiding SQLi. I actually tried to repurpose this module by calling the as_string() method and using the result to query BigQuery. But the resulting syntax doesn't match and I need to start a postgres connection to do it (as_string() expects a cursor/connection object). I also tried something similar with sqlalchemy.text to no avail. So I concluded I'd have to basically implement some way of parameterizing the table name myself, or implement some workaround using the python client library. Any ideas of how I should go about doing this in a safe way that won't lead to SQLi? Cannot go into detail but unfortunately I cannot store the tables in postgres or any other db.
As discussed in the comments, the best option for avoiding SQLi in your case is ensuring your server's security.
If anyway you need/want to parse your input parameter before building your query, I recommend you to use REGEX in order to check the input strings.
In Python you could use the re library.
As I don't know how your code works, how your datasets/tables are organized and I don't know exactly how you are planing to check if the string is a valid source, I created the basic example below that shows how you could check a string using this library
import re
tests = ["your-dataset.your-table","(SELECT * FROM <another table>)", "dataset-09123.my-table-21112019"]
#Supposing that the input pattern is <dataset>.<table>
regex = re.compile("[a-zA-Z0-9-]+\.[a-zA-Z0-9-]+")
for t in tests:
if(regex.fullmatch(t)):
print("This source is ok")
else:
print("This source is not ok")
In this example, only strings that matches the configuration dataset.table (where both the dataset and the table must contain only alphanumeric characters and dashes) will be considered as valid.
When running the code, the first and the third elements of the list will be considered valid while the second (that could potentially change your whole query) will be considered invalid.

Database schema - unable to link the tables in a way that suits the program

I have created a python script to scrape document for keywords. It runs on a server and a cronjob makes sure that searches are performed multiple times per day (with different keywords at different hours).
To store the results I created the following table:
TABLE: 'SEARCHES:'
search_date(string)
number_of_results(integer)
keywords_used(string)
^-- I created a single string from all keywords
All of this was easy to implement in Python/SQLite. However, now I want to track the number of results per individual keyword.
I have already created created a 'keyword' table,
TABLE: 'KEYWORDS:'
word(string)
total_hits(integer)
last_used(string)
However, I am having trouble coming up with a way of linking both databases in a way that allows me to link the keywords to searches. Presumably the 'searches' table will have a foreignkey linking to the keywords. But there can be as many as 10 keywords per search, yet there is only one foreignkey column.
I looked into ManyToMany relations, but as I understand this would create a large number of rows containing both 'search_id' and 'keyword_id'. Yet, all I need is 1 row per search.
When the program is finished I want create a GUI frontend and be able to list all searches that have been performed in a list / table. Showing not just the keywords that have been used, but information like the search date as well. 1 line per search.
I also want to create a separate overview for the individual keywords, showing their effectiveness.
I'm just unable to come up with a database schema to accommodate this and could use some help to get my nose in the right direction.
I would suggest creating at "Matches" table that is a child to "Searches". Add a "Search ID" field to the Searches table to support the foreign key. The "Matches" table would hold the Search ID, each individual keyword, and perhaps the total hits for that keyword.
Then you can match "Matches" to "Keywords", and go from "Matches" to "Searches" using Search ID.

SQLAlchemy - Performing AND operation , Combining 'like' and 'in_'

I am trying to write an SQLAlchemy query.
There is AND operation is needed to be performed as well as I need to know the way of Combining 'like' and 'in_' of SQLALchemy.
I tried the query as follows:
query = query.filter(and_(model.tenant_id.in_(tenant_ids) , model.name.like(ips)))
Here what I need is way to achieve combination of "in" and "like" of SQLAlchemy.
Following section is the one which I need to apply "in" and "like".
model.name.like(ips)
Consider ips field is an list, It contains following data.
['1.6.7.1']
I need to apply % before and after the same.
How can I do this.?
We cant apply like this here.
model.data.like(%ips%)
So some one let me know the solution for the same.

How to perform this insert into PostgreSQL using MySQL data?

I'm in the processing of moving over a mysql database to a postgres database. I have read all of the articles presented here, as well as reading over some of the solutions presented on stackoverflow. The tools recommended don't seem to work for me. Both databases were generated by Django's syncdb, although the postgres db is more or less empty at the moment. I tried to migrate the tables over using Django's built in dumpdata / loaddata functions and its serializers, but it doesn't seem to like a lot of my tables, leading me to believe that writing a manual solution might be best in this case. I have code to verify that the column headers are the same for each table in the database and that the matching tables exist- that works fine. I was thinking it would be best to just grab the mysql data row by row and then insert it into the respective postgres table row by row (I'm not concerned with speed atm). The one thing is, I don't know what's the proper way to construct the insert statement. I have something like:
table_name = retrieve_table()
column_headers = get_headers(table_name) #can return a tuple or a list
postgres_cursor = postgres_con.cursor()
rows = mysql_cursor.fetchall()
for row in rows: #row is a tuple
postgres_cursor.execute(????)
Where ??? would be the insert statement. I just don't know what the proper way is to construct it. I have the table name that I would like to insert into as a string, I have the column headers that I can treat as a list, tuple, or string, and I have the respective values that I'd like to insert. What would be the recommended way to construct the statement? I have read the documentation on psycopg's documentation page and I didn't quite see the way that would satisfy my needs. I don't know (or think) this is the entirely correct way to properly migrate, so if someone could steer me in the correct way or offer any advice I'd really appreciate it.

Searching across multiple tables (best practices)

I have property management application consisting of tables:
tenants
landlords
units
properties
vendors-contacts
Basically I want one search field to search them all rather than having to select which category I am searching. Would this be an acceptable solution (technology wise?)
Will searching across 5 tables be OK in the long run and not bog down the server? What's the best way of accomplishing this?
Using PostgreSQL
Why not create a view which is a union of the tables which aggregates the columns you want to search on into one, and then search on that aggregated column?
You could do something like this:
select 'tenants:' + ltrim(str(t.Id)), <shared fields> from Tenants as t union
select 'landlords:' + ltrim(str(l.Id)), <shared fields> from Tenants as l union
...
This requires some logic to be embedded from the client querying; it has to know how to fabricate the key that it's looking for in order to search on a single field.
That said, it's probably better if you just have a separate column which contains a "type" value (e.g. landlord, tenant) and then filter on both the type and the ID, as it will be computationally less expensive (and can be optimized better).
You want to use the built-in full text search or a separate product like Lucene. This is optimised for unstructured searches over heterogeneous data.
Also, don't forget that normal indices cannot be used for something LIKE '%...%'. Using a full text search engine will also be able to do efficient substring searches.
I would suggest using a specialized full-text indexing tool like Lucene for this. It will probably be easier to get up and running, and the result is faster and more featureful too. Postgres full text indexes will be useful if you also need structured search capability on top of this or transactionality of your search index is important.
If you do want to implement this in the database, something like the following scheme might work, assuming you use surrogate keys:
for each searchable table create a view that has the primary key column of that table, the name of the table and a concatenation of all the searchable fields in that table.
create a functional GIN or GiST index on the underlying over the to_tsvector() of the exact same concatenation.
create a UNION ALL over all the views to create the searchable view.
After that you can do the searches like this:
SELECT id, table_name, ts_rank_cd(body, query) AS rank
FROM search_view, to_tsquery('search&words') query
WHERE query ## body
ORDER BY rank DESC
LIMIT 10;
You should be fine, and there's really no other good (easy) way to do this. Just make sure the fields you are searching on are properly indexed though.

Categories

Resources