Problem Summary:
I'm using Python to send a series of queries to a database (one by one) from a loop until a non-empty result set is found. The query has three conditions that must be met and they're placed in a where statement. Every iteration of the loop changes and manipulates the conditions from a specific condition to a more generic one.
Details:
Assuming the conditions are keywords based on a pre-made list ordered by accuracy such as:
Option KEYWORD1 KEYWORD2 KEYWORD3
1 exact exact exact # most accurate!
2 generic exact exact # accurate
3 generic generic exact # close enough
4 generic generic generic # close
5 generic+ generic generic # almost there
.... and so on.
On the database side, I have a description column that should contain all the three keywords either in their specific form or a generic form. When I run the loop in python this is what actually happens:
-- The first sql statement will be like
Select *
From MyTable
Where Description LIKE 'keyword1-exact$'
AND Description LIKE 'keyword2-exact%'
AND Description LIKE 'keyword3-exact%'
-- if no results, the second sql statement will be like
Select *
From MyTable
Where Description LIKE 'keyword1-generic%'
AND Description LIKE 'keyword2-exact%'
AND Description LIKE 'keyword3-exact%'
-- if no results, the third sql statement will be like
Select *
From MyTable
Where Description LIKE 'keyword1-generic%'
AND Description LIKE 'keyword2-generic%'
AND Description LIKE 'keyword3-exact%'
-- and so on until a non-empty result set is found or all keywords were used
I'm using the approach above to get the most accurate results with the minimum amount of irrelevant ones (the more generic the keywords, the more irrelevant results will show up and they will need additional processin)
Question:
My approach above is doing exactly what I want but I'm sure that it's not efficient.
What would be the proper way to do this operation in a query instead of Python loop (knowing that I only have a read access to the database so I can't store procedures)?
Here is an idea
select top 1
*
from
(
select
MyTable.*,
accuracy = case when description like keyword1 + '%'
and description like keyword2 + '%'
and description like keyword3 + '%'
then accuracy
end
-- an example of data from MyTable
from (select description = 'exact') MyTable
cross join
(values
-- generate full list like this in python
-- or read it from a table if it is in database
(1, ('exact'), ('exact'), ('exact')),
(2, ('generic'), ('exact'), ('exact')),
(3, ('generic'), ('generic'), ('exact'))
) t(accuracy, keyword1, keyword2, keyword3)
) t
where accuracy is not null
order by accuracy
I would not do a loop over database queries. Instead I would search for the least specific, i.e. most generic, keyword and return all these rows.
Select *
From MyTable
Where Description LIKE '%iPhone%'
This returns all the rows with iPhones. Now do the further processing, i.e. find the best match, in memory. This is much faster than multiple queries.
If you have several equally most generic keywords, then query them with OR
Select *
From MyTable
Where Description LIKE '%iPhone%' OR
Description LIKE '%i-Phone%'
But any case make only one query.
Please try using RegEx regular expression functionality of Sql server.
Or else you can try impoting re in python for regular expression.
First you can collect the data and then try re to achieve you r goal.
Hope this is helpful.
Related
I have this query:
SELECT COUNT(DISTINCT Serial, DatumOrig, Glucose) FROM values;
I've tried to recreate it with SQLAlchemy this way:
session.query(Value.Serial, Value.DatumOrig, Value.Glucose).distinct().count()
But this translates to this:
SELECT count(*) AS count_1
FROM (SELECT DISTINCT
values.`Serial` AS `values_Serial`,
values.`DatumOrig` AS `values_DatumOrig`,
values.`Glucose` AS `values_Glucose`
FROM values)
AS anon_1
Which does not call the count function inline but wraps the select distinct into a subquery.
My question is: What are the different ways with SQLAlchemy to count a distinct select on multiple columns and what are they translating into?
Is there any solution which would translate into my original query? Is there any serious difference in performance or memory usage?
First off, I think that COUNT(DISTINCT) supporting more than 1 expression is a MySQL extension. You can kind of achieve the same in for example PostgreSQL with ROW values, but the behaviour is not the same regarding NULL. In MySQL if any of the value expressions evaluate to NULL, the row does not qualify. That also leads to what is different between the two queries in the question:
If any of Serial, DatumOrig, or Glucose is NULL in the COUNT(DISTINCT) query, that row does not qualify or in other words does not count.
COUNT(*) is the cardinality of the subquery anon_1, or in other words the count of rows. SELECT DISTINCT Serial, DatumOrig, Glucose will include (distinct) rows with NULL.
Looking at EXPLAIN output for the 2 queries it looks like the subquery causes MySQL to use a temporary table. That will likely cause a performance difference, especially if it is materialized on disk.
Producing the multi valued COUNT(DISTINCT) query in SQLAlchemy is a bit tricky, because count() is a generic function and implemented closer to the SQL standard. It only accepts a single expression as its (optional) positional argument and the same goes for distinct(). If all else fails, you can always revert to text() fragments, like in this case:
# NOTE: text() fragments are included in the query as is, so if the text originates
# from an untrusted source, the query cannot be trusted.
session.query(func.count(distinct(text("`Serial`, `DatumOrig`, `Glucose`")))).\
select_from(Value).\
scalar()
which is far from readable and maintainable code, but gets the job done now. Another option is to write a custom construct that implements the MySQL extension, or rewrite the query as you have attempted. One way to form a custom construct that produces the required SQL would be:
from itertools import count
from sqlalchemy import func, distinct as _distinct
def _comma_list(exprs):
# NOTE: Magic number alert, the precedence value must be large enough to avoid
# producing parentheses around the "comma list" when passed to distinct()
ps = count(10 + len(exprs), -1)
exprs = iter(exprs)
cl = next(exprs)
for p, e in zip(ps, exprs):
cl = cl.op(',', precedence=p)(e)
return cl
def distinct(*exprs):
return _distinct(_comma_list(exprs))
session.query(func.count(distinct(
Value.Serial, Value.DatumOrig, Value.Glucose))).scalar()
I am trying to get a query into a variable called results, in which I query the database to find the books with a title like the input from the search bar received from a post method. The query I am running is as follows:
results = db.execute("SELECT * FROM books WHERE title LIKE (%:search%)", {"search": search}).fetchall();
With the above query, I get the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) syntax error at or near "%".
This works as expected if I remove the %, or if I manually give the LIKE a parameter (eg: LIKE ('%the%')), but this does not really give back any results unless the search is exactly as one of the book titles in the database, and it defeats the purpose of using variable substitution by hard coding the parameters. I am also wondering if it's possible to use ILIKE for case insensitive when querying with SQLAlchemy.
I am aware that I could use Object Relational Mapping, and use different functions such as the filter function and whatnot, but for this assignment we are meant to not use ORM and use simple queries. Any suggestions?
Pass the entire search string as the parameter to the LIKE operator:
results = db.execute(text("SELECT * FROM books WHERE title LIKE :search"),
{"search": f"%{search}%"}).fetchall();
or alternatively concatenate in the database:
results = db.execute(
text("SELECT * FROM books WHERE title LIKE ('%' || :search || '%')"),
{"search": search}).fetchall();
I m trying to create a method where I can pass a parameter (a number) and get the number as my number of output. See below:
def get_data(i):
for i in range(0,i):
TNG = "SELECT DISTINCT hub, value, date_inserted FROM ZE_DATA.AESO_CSD_SUMMARY where opr_date >= trunc(sysdate) order by date_inserted desc fetch first i rows only"
Where i is a number. Inside the query "fetch first i rows only" , i want it to query i number of rows.
Thoughts on the syntax?
Seems like you're looking for a limit argument. You didn't mention what type of SQL you're using, but here are a couple of examples for various SQL languages.
I'm also a little confused by the structure of that function, seems like you may want to query the result set, then iterate through it rather than query the result set i number of times.
I am trying to get a query into a variable called results, in which I query the database to find the books with a title like the input from the search bar received from a post method. The query I am running is as follows:
results = db.execute("SELECT * FROM books WHERE title LIKE (%:search%)", {"search": search}).fetchall();
With the above query, I get the following error:
sqlalchemy.exc.ProgrammingError: (psycopg2.ProgrammingError) syntax error at or near "%".
This works as expected if I remove the %, or if I manually give the LIKE a parameter (eg: LIKE ('%the%')), but this does not really give back any results unless the search is exactly as one of the book titles in the database, and it defeats the purpose of using variable substitution by hard coding the parameters. I am also wondering if it's possible to use ILIKE for case insensitive when querying with SQLAlchemy.
I am aware that I could use Object Relational Mapping, and use different functions such as the filter function and whatnot, but for this assignment we are meant to not use ORM and use simple queries. Any suggestions?
Pass the entire search string as the parameter to the LIKE operator:
results = db.execute(text("SELECT * FROM books WHERE title LIKE :search"),
{"search": f"%{search}%"}).fetchall();
or alternatively concatenate in the database:
results = db.execute(
text("SELECT * FROM books WHERE title LIKE ('%' || :search || '%')"),
{"search": search}).fetchall();
I've got a weekly process which does a full replace operation on a few tables. The process is weekly since there are large amounts of data as a whole. However, we also want to do daily/hourly delta updates, so the system would be more in sync with production.
When we update data, we are creating duplications of rows (updates of an existing row), which I want to get rid of. To achieve this, I've written a python script which runs the following query on a table, inserting the results back into it:
QUERY = """#standardSQL
select {fields}
from (
select *
, max(record_insert_time) over (partition by id) as max_record_insert_time
from {client_name}_{environment}.{table} as a
)
where 1=1
and record_insert_time = max_record_insert_time"""
The {fields} variable is replaced with a list of all the table columns; I can't use * here because that would only work for 1 run (the next will already have a field called max_record_insert_time and that would cause an ambiguity issue).
Everything is working as expected, with one exception - some of the columns in the table are of RECORD datatype; despite not using aliases for them, and selecting their fully qualified name (e.g. record_name.child_name), when the output is written back into the table, the results are flattened. I've added the flattenResults: False config to my code, but this has not changed the outcome.
I would love to hear thoughts about how to resolve this issue using my existing plan, other methods of deduping, or other methods of handling delta updates altogether.
Perhaps you can use in the outer statement
SELECT * EXCEPT (max_record_insert_time)
This should keep the exact record structure. (for more detailed documentation see https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#select-except)
Alternative approach, would be include in {fields} only top level columns even if they are non leaves, i.e. just record_name and not record_name.*
Below answer is definitely not better than use of straightforward SELECT * EXCEPT modifier, but wanted to present alternative version
SELECT t.*
FROM (
SELECT
id, MAX(record_insert_time) AS max_record_insert_time,
ARRAY_AGG(t) AS all_records_for_id
FROM yourTable AS t GROUP BY id
), UNNEST(all_records_for_id) AS t
WHERE t.record_insert_time = max_record_insert_time
ORDER BY id
What above query does is - first groups all records for each id into array of respective rows along with max value for insert_time. Then, for each id - it simply flattens all (previously aggregated) rows and picks only rows with insert_time matching max time. Result is as expected. No Analytic Function involved but rather simple Aggregation. But extra use of UNNEST ...
Still - at least different option :o)