To desc my problem. Can see this raw sql:
select datediff(now(), create_time) > 7 as is_new from test order by is_new desc limit 19;
I try to implement by SQLAlchemy step by step:
diff_days = func.datediff(today, test.create_time).label("diff_days")
session.query(diff_days).filter(test.id.in_((1,2,3,33344))).order_by(diff_days.asc()).all()
This work fine. But when I want to desc > in mysql. It failed:
is_new = func.greater(func.datediff(today, test.create_time), 7).label("is_new")
session.query(is_new).filter(test.id.in_((1,2,3,33344))).order_by(is_new.asc()).all()
I know SQLAlchemy explain my sql to greater while mysql don't support. So How can I to get my answer a > b with something like greater(a, b)
May be the simple sql select a > b from test can desc the problem too. While above is my origin need. So the problem can change :
How to using SQLAIchemy orm to implement select a > b from test.
SQLAlchemy offers you rich operator overloading, so just do
is_new = (func.datediff(today, test.create_time) > 7).label("is_new")
session.query(is_new).\
filter(test.id.in_([1, 2, 3, 33344])).\
order_by(is_new.asc()).\
all()
This works since the created Function is also a ColumnElement and as such has ColumnOperators.
Related
I am trying to add a very basic SQL query in my flask API project. I am using SQLAlchemy as the database manipulation tool.
The query I want to run is the following:
SELECT * from trip_metadata where trip_id in ('trip_id_1', 'trip_id_2', ..., 'trip_id_n')
So, in my code, I wrote:
trips_ids = ['trip_id_1', 'trip_id_2', ..., 'trip_id_n']
result = session.query(dal.trip_table).filter(dal.trip_table.columns.trip_id.in_(trips_ids)).all()
When n is low, let'say n=10, it works very well. I get the expected result. However, when n is high, let's say n > 1000, it crashes. I am very surprised as it seems usual to put many values in the filter.
from sqlalchemy import text
result = session.execute(text(f"SELECT * FROM trip_metadata where trip_id in {trip_ids_tuple}"))
The error log is:
sqlalchemy.exc.DBAPIError: (pyodbc.Error) ('07002', '[07002] [Microsoft][ODBC Driver 17 for SQL Server]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')')
[SQL: SELECT * FROM trip_metadata
WHERE trip_metadata.trip_id IN (?, ?, ..., ?)]
[parameters: ('ABC12345-XXXX-XXXX-XXXX-000000000000', 'DEF12345-XXXX-XXXX-XXXX-000000000000', ..., 'GHI12345-XXXX-XXXX-XXXX-000000000000')]
(Background on this error at: https://sqlalche.me/e/14/dbapi)
127.0.0.1 - - [05/Jan/2023 10:35:48] "POST /api/v1/tripsAggregates HTTP/1.1" 500 -
However when I write the raw request, it works well, even when n is very high:
from sqlalchemy import text
trip_ids_tuple = ('trip_id_1', 'trip_id_2', ..., 'trip_id_n')
result = session.execute(text(f"SELECT * FROM trip_metadata where trip_id in {trip_ids_tuple}"))
But I don't think this is a good way of doing because I have much more complex requests to write and using sqlalchemy filters is more adapted.
Do you have any idea to fix my issue keeping using sqlalchemy library ? Thank you very much
Microsoft's ODBC drivers for SQL Server execute statements using a system stored procedure on the server (sp_prepexec or sp_prepare). Stored procedures on SQL Server are limited to 2100 parameter values, so with a model like
class Trip(Base):
__tablename__ = "trip"
id = Column(String(32), primary_key=True)
this code will work
with Session(engine) as session:
trips_ids = ["trip_id_1", "trip_id_2"]
q = session.query(Trip).where(Trip.id.in_(trips_ids))
results = q.all()
"""SQL emitted:
SELECT trip.id AS trip_id
FROM trip
WHERE trip.id IN (?, ?)
[generated in 0.00092s] ('trip_id_1', 'trip_id_2')
"""
because it only has two parameter values. If the length of the trips_ids list is increased to thousands of values the code will eventually fail.
One way to avoid the issue is to have SQLAlchemy construct an IN clause with literal values instead of parameter placeholders:
q = session.query(Trip).where(
Trip.id.in_(bindparam("p1", expanding=True, literal_execute=True))
)
results = q.params(p1=trips_ids).all()
"""SQL emitted:
SELECT trip.id AS trip_id
FROM trip
WHERE trip.id IN ('trip_id_1', 'trip_id_2')
[generated in 0.00135s] ()
"""
From the error, it could indicate a formatting issue (escape string characters properly?). This can happen when N is small, the data that breaks the formatting has low chance of occurring. When N gets large, there's more likelihood there is "bad data" that sqlalchemy tries to put into the query. Can't say for certain here, it might be a memory or operating system issue too.
First thing to ask here, do you need to have tuples provided externally into the query? Is there a way to query for the the trip_ids via a join? Usually it's best to push operations to the SQL engine, but this isn't always possible if you're getting the tuples/lists of id's elsewhere.
Rule out if there's a data issue that errors out during the execute(). Look into escaping the string values. You can try chunking the list into smaller bits to narrow down the potentially problematic values (see appendix below)
Try a different way to string format the query.
sql = f"SELECT * FROM trip_metadata where trip_id in ({','.join(trip_ids_tuple)})"
sql output str value:
'SELECT * FROM trip_metadata where trip_id in (trip_id_1,trip_id_2,trip_id_n)'
Appendix:
You can potentially build in a chunker mechanism to break. The crashing might be a operating system or memory issue. For Example, you can use list slicing:
def chunker(seq, size):
return (seq[pos:pos + size] for pos in range(0, len(seq), size))
# Example usage
my_list = [1, 2, 3, 4, 5, 6, 7, 8]
chunk_size = 3
for chunk in chunker(my_list, chunk_size):
print(chunk)
output
[1, 2, 3]
[4, 5, 6]
[7, 8]
Use a chunk size where N is manageable. This will also help narrow down potentially problematic str values that errors out.
I wrote a code in python to manipulate a table I have in my database. I am doing so using SQL Alchemy. Basically I have table 1 that has 2 500 000 entries. I have another table 2 with 200 000 entries. Basically what I am trying to do, is compare my source ip and dest ip in table 1 with source ip and dest ip in table 2. if there is a match, I replace the ip source and ip dest in table 1 with a data that matches ip source and ip dest in table 2 and I add the entry in table 3. My code also checks if the entry isn't already in the new table. If so, it skips it and then goes on with the next row.
My problem is its extremely slow. I launched my script yesterday and in 24 hours it only went through 47 000 entries out of 2 500 000. I am wondering if there are anyways I can speed up the process. It's a postgres db and I can't tell if the script taking this much time is reasonable or if something is up. If anyone had a similar experience with something like this, how much time did it take before completion ?
Many thanks.
session = Session()
i = 0
start_id = 1
flows = session.query(Table1).filter(Table1.id >= start_id).all()
result_number = len(flows)
vlan_list = {"['0050']", "['0130']", "['0120']", "['0011']", "['0110']"}
while i < result_number:
for flow in flows:
if flow.vlan_destination in vlan_list:
usage = session.query(Table2).filter(Table2.ip ==
str(flow.ip_destination)).all()
if len(usage) > 0:
usage = usage[0].usage
else:
usage = str(flow.ip_destination)
usage_ip_src = session.query(Table2).filter(Table2.ip ==
str(flow.ip_source)).all()
if len(usage_ip_src) > 0:
usage_ip_src = usage_ip_src[0].usage
else:
usage_ip_src = str(flow.ip_source)
if flow.protocol == "17":
protocol = func.REPLACE(flow.protocol, "17", 'UDP')
elif flow.protocol == "1":
protocol = func.REPLACE(flow.protocol, "1", 'ICMP')
elif flow.protocol == "6":
protocol = func.REPLACE(flow.protocol, "6", 'TCP')
else:
protocol = flow.protocol
is_in_db = session.query(Table3).filter(Table3.protocol ==
protocol)\
.filter(Table3.application == flow.application)\
.filter(Table3.destination_port == flow.destination_port)\
.filter(Table3.vlan_destination == flow.vlan_destination)\
.filter(Table3.usage_source == usage_ip_src)\
.filter(Table3.state == flow.state)\
.filter(Table3.usage_destination == usage).count()
if is_in_db == 0:
to_add = Table3(usage_ip_src, usage, protocol, flow.application, flow.destination_port,
flow.vlan_destination, flow.state)
session.add(to_add)
session.flush()
session.commit()
print("added " + str(i))
else:
print("usage already in DB")
i = i + 1
session.close()
EDIT As requested, here are more details : Table 1 has 11 columns, the two we are interested in are source ip and dest ip.
Table 1
Here, I have Table 2 :Table 2. It has an IP and a Usage. What my script is doing is that it takes source ip and dest ip from table one and looks up if there is a match in Table 2. If so, it replaces the ip address by usage, and adds this along with some of the columns of Table 1 in Table 3 :[Table3][3]
Along doing this, when adding the protocol column into Table 3, it writes the protocol name instead of the number, just to make it more readable.
EDIT 2 I am trying to think about this differently, so I did a diagram of my problem Diagram (X problem)
What I am trying to figure out is if my code (Y solution) is working as intended. I've been coding in python for a month only and I feel like I am messing something up. My code is supposed to take every row from my Table 1, compare it to Table 2 and add data to table 3. My Table one has over 2 million entries and it's understandable that it should take a while but its too slow. For example, when I had to load the data from the API to the db, it went faster than the comparisons im trying to do with everything that is already in the db. I am running my code on a virtual machine that has sufficient memory so I am sure it's my code that is lacking and I need direction to as what can be improved. Screenshots of my tables:
Table 2
Table 3
Table 1
EDIT 3 : Postgresql QUERY
SELECT
coalesce(table2_1.usage, table1.ip_source) AS coalesce_1,
coalesce(table2_2.usage, table1.ip_destination) AS coalesce_2,
CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END AS anon_1,
table1.application AS table1_application,
table1.destination_port AS table1_destination_port,
table1.vlan_destination AS table1_vlan_destination,
table1.state AS table1_state
FROM
table1
LEFT OUTER JOIN table2 AS table2_2 ON table2_2.ip = table1.ip_destination
LEFT OUTER JOIN table2 AS table2_1 ON table2_1.ip = table1.ip_source
WHERE
table1.vlan_destination IN (
%(vlan_destination_1) s,
%(vlan_destination_2) s,
%(vlan_destination_3) s,
%(vlan_destination_4) s,
%(vlan_destination_5) s
)
AND NOT (
EXISTS (
SELECT
1
FROM
table3
WHERE
table3.usage_source = coalesce(table2_1.usage, table1.ip_source)
AND table3.usage_destination = coalesce(table2_2.usage, table1.ip_destination)
AND table3.protocol = CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END
AND table3.application = table1.application
AND table3.destination_port = table1.destination_port
AND table3.vlan_destination = table1.vlan_destination
AND table3.state = table1.state
)
)
Given the current question, I think this at least comes close to what you might be after. The idea is to perform the entire operation in the database, instead of fetching everything – the whole 2,500,000 rows – and filtering in Python etc.:
from sqlalchemy import func, case
from sqlalchemy.orm import aliased
def newhotness(session, vlan_list):
# The query needs to join Table2 twice, so it has to be aliased
dst = aliased(Table2)
src = aliased(Table2)
# Prepare required SQL expressions
usage = func.coalesce(dst.usage, Table1.ip_destination)
usage_ip_src = func.coalesce(src.usage, Table1.ip_source)
protocol = case({"17": "UDP",
"1": "ICMP",
"6": "TCP"},
value=Table1.protocol,
else_=Table1.protocol)
# Form a query producing the data to insert to Table3
flows = session.query(
usage_ip_src,
usage,
protocol,
Table1.application,
Table1.destination_port,
Table1.vlan_destination,
Table1.state).\
outerjoin(dst, dst.ip == Table1.ip_destination).\
outerjoin(src, src.ip == Table1.ip_source).\
filter(Table1.vlan_destination.in_(vlan_list),
~session.query(Table3).
filter_by(usage_source=usage_ip_src,
usage_destination=usage,
protocol=protocol,
application=Table1.application,
destination_port=Table1.destination_port,
vlan_destination=Table1.vlan_destination,
state=Table1.state).
exists())
stmt = insert(Table3).from_select(
["usage_source", "usage_destination", "protocol", "application",
"destination_port", "vlan_destination", "state"],
flows)
return session.execute(stmt)
If the vlan_list is selective, or in other words filters out most rows, this will perform a lot less operations in the database. Depending on the size of Table2 you may benefit from indexing Table2.ip, but do test first. If it is relatively small, I would guess that PostgreSQL will perform a hash or nested loop join there. If some column of the ones used to filter out duplicates in Table3 is unique, you could perform an INSERT ... ON CONFLICT ... DO NOTHING instead of removing duplicates in the SELECT using the NOT EXISTS subquery expression (which PostgreSQL will perform as an antijoin). If there is a possibility that the flows query may produce duplicates, add a call to Query.distinct() to it.
Hello I am trying to translate the following relatively simple query to SQLAlchemy but I get
('Unexpected error:', <class 'sqlalchemy.exc.InvalidRequestError'>)
SELECT model, COUNT(model) AS count FROM log.logs
WHERE SOURCE = "WEB" AND year(timestamp) = 2015 AND month(timestamp) = 1
and account = "Test" and brand = "Nokia" GROUP BY model ORDER BY count DESC limit 10
This is what I wrote but it is not working. What is wrong ?
devices = db.session.query(Logs.model).filter_by(source=source).filter_by(account=acc).filter_by(brand=brand).\
filter_by(year=year).filter_by(month=month).group_by(Logs.model).order_by(Logs.model.count().desc()).all()
It's a bit hard to tell from your code sample, but the following is hopefully the correct SQLAlchemy code. Try:
from sqlalchemy.sql import func
devices = (db.session
.query(Logs.model, func.count(Logs.model).label('count'))
.filter(source=source)
.filter_by(account=acc)
.filter_by(brand=brand)
.filter_by(year=year)
.filter_by(month=month)
.group_by(Logs.model)
.order_by(func.count(Logs.model).desc()).all())
Note that I've enclosed the query in a (...) to avoid having to use \ at the end of each line.
Here is my mssql code snippet
SELECT
Sum((Case When me.status not in ('CLOSED','VOID') and me.pc_cd in ('IK','JM')
Then 1 else 0 end)) as current_cd
from
ccd_pvc me with(nolock)
How would i use the and operator with case statement if i write the above statement in sqlalchemy.
I have tried doing this but did not work
case([and_((ccd_pvc.status.in_(['CLOSED', 'VOID']),ccd_pvc.pc_cd.in_(['IK','JM'])),
literal_column("'greaterthan100'"))])
I have searched through the sqlalchemy documentation but did not find the info on using logical operators with case statement.
The link has some info on this.
This should get you started:
ccd_pvc = aliased(CcdPvc, name="me")
expr = func.sum(
case([(and_(
ccd_pvc.status.in_(['CLOSED', 'VOID']),
ccd_pvc.pc_cd.in_(['IK', 'JM'])
), 1)], else_=0)
).label("current_cd")
q = session.query(expr)
I'm using the SQLAlchemy ORM to construct the MySQL queries in my application, and am perfectly able to add basic filters to the query, like so:
query = meta.Session.query(User).filter(User.user_id==1)
Which gives me something basically equivalent to this:
SELECT * FROM users WHERE user_id = 1
My question is how I would integrate some basic MySQL math functions into my query. So, say for instance I wanted to get users near a certain latitude and longitude. So I need to generate this SQL ($mylatitude and $mylongitude are the static latitude and longitude I'm comparing against):
SELECT * FROM users
WHERE SQRT(POW(69.1 * (latitude - $mylatitude),2) + POW(53.0 * (longitude - $mylongitude),2)) < 5
Is there a way I can incorporate these functions into a query using the SQLAlchemy ORM?
You can use literal SQL in your filter, see here: http://www.sqlalchemy.org/docs/05/ormtutorial.html?highlight=text#using-literal-sql
For example:
clause = "SQRT(POW(69.1 * (latitude - :lat),2) + POW(53.0 * (longitude - :long),2)) < 5"
query = meta.Session.query(User).filter(clause).params(lat=my_latitude, long=my_longitude)
I'd use the query builder interface and the func SQL function constructor to abstract the calculation as a function. This way you can use it freely with aliases or joins.
User.coords = classmethod(lambda s: (s.latitude, s.longitude))
def calc_distance(latlong1, latlong2):
return func.sqrt(func.pow(69.1 * (latlong1[0] - latlong2[0]),2)
+ func.pow(53.0 * (latlong1[1] - latlong2[1]),2))
meta.Session.query(User).filter(calc_distance(User.coords(), (my_lat, my_long)) < 5)