Inputting dates from python to a database - python

I have the following query in which i'm trying to pass start dates and end dates in a sql query.
def get_data(start_date,end_date):
ic = Connector()
q = f"""
select * from example_table a
where a.date between {start_date} and {end_date}
"""
result = ic.query(q)
return result
df = pd.DataFrame(get_data('2021-01-01','2021-01-31'))
print(df)
which leads to the following error:
AnalysisException: Incompatible return types 'STRING' and 'BIGINT' of exprs 'a.date' and '2021 - 1 - 1'.\n (110) (SQLExecDirectW)")
I have also tried to parse the dates as follows:
import datetime
start_date = datetime.date(2021,1,1)
end_date = datetime.date(2021,5,13)
df = pd.DataFrame(get_data(start_date,end_date))
but i still get the same error.
Any help will be much appreciated.

It seems to me, that it is because how you inject your values into sql query they don't get recognized as date values. Database will likely 2021-01-01 interpret as mathematical expression with 2019 being the result.
You should try put parentheses around your values.
q = f"""
select * from example_table a
where a.date between '{start_date}' and '{end_date}'
"""
Or preferably if your db library allows it don't inject your values directly
q = """
select * from example_table a
where a.date between %s and %s
"""
result = ic.query(q, (start_date, end_date))
EDIT: Some database libraries may use place-holder with different format than %s. You should probably consult documentation of db library you are using.

Related

Why does identical code for SQL Merge (Upsert) work in Microsoft SQL Server console but doesn't work in Python?

I have a function in my main Python file which gets called by main() and executes a SQL Merge (Upsert) statement using pyodbc from a different file & function. Concretely, the SQL statement traverses a source table with transaction details by distinct transaction datetimes and merges customers into a separate target table. The function that executes the statement and the function that returns the completed SQL statement are attached below.
When I run my Python script, it doesn't work as expected and inserts only around 70 rows (sometimes 69, 71, or 72) into the target customer table. However, when I use an identical SQL statement and execute it in the Microsoft SQL Server Management Studio console (attached below), it works fine and inserts 4302 rows (as expected).
I'm not sure what's wrong.. Would really appreciate any help!
SQL Statement Executor in Python main file:
def stage_to_dim(connection, cursor, now):
log(f"Filling {cfg.dim_customer} and {cfg.dim_product}")
try:
cursor.execute(sql_statements.stage_to_dim_statement(now))
connection.commit()
except Exception as e:
log(f"Error in stage_to_dim: {e}" )
sys.exit(1)
log("Stage2Dimensions complete.")
SQL Statement formulator in Python:
def stage_to_dim_statement(now):
return f"""
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM {cfg.stage_table} ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE {cfg.dim_customer} AS Target
USING (SELECT * FROM {cfg.stage_table} WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME, '{now}'), Target.LoadSource = '{cfg.ingest_file_path}'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID,
Source.AcquisitionDate, Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'{now}'), '{cfg.ingest_file_path}');
END
"""
SQL Statement from MS SQL Server Console:
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM dbo.STG_CustomerTransactions ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE dbo.DIM_CustomerDup AS Target
USING (SELECT * FROM dbo.STG_CustomerTransactions WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME,'6/30/2022 11:53:05'), Target.LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID, Source.AcquisitionDate,
Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'6/30/2022 11:53:05'), '../csv/cleaned_original_data.csv');
END
If you think carefully about what your final result ends up, you are actually just taking the latest row (by date) for each customer. So you can just filter the source using a standard row-number approach.
Exactly why the Python code didn't work properly is unclear, but the below query might work better. You are also doing SQL injection, which is dangerous and can also cause correctness problems.
Also you should always use a non-ambiguous date format.
MERGE dbo.DIM_CustomerDup AS t
USING (
SELECT *
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TransactionDateTime DESC)
FROM dbo.STG_CustomerTransactions s
) AS s
WHERE s.rn = 1
) AS s
ON t.CustomerCodeNK = s.CustomerID
WHEN MATCHED THEN
UPDATE SET
AquiredDate = s.AcquisitionDate,
AquiredSource = s.AcquisitionSource,
ZipCode = s.Zipcode,
LoadDate = SYSDATETIME(),
LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource)
VALUES (s.CustomerID, s.AcquisitionDate, s.AcquisitionSource, s.Zipcode, SYSDATETIME(), '../csv/cleaned_original_data.csv')
;

Combining Python variables into SQL queries

I am pulling data from an online database using SQL/postgresql queries and converting it into a Python dataframe using Pandas. I want to be able to change the dates in the SQL query from one point in my Python script instead of having to manually go through every SQL query and change it one by one as there are many queries and many lines in each one.
This is what I have to begin with for example:
random_query = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('2022-03-01')
and date_trunc('day',a.created_at) <= date('2022-03-31')
group by 1,2,3
"""
Then I will read the data into Pandas as follows:
df_random_query = pd.read_sql(random_query, conn)
The connection above is to the database - the issue is not there so I am excluding that portion of code here.
What I have attempted is the following:
start_date = '2022-03-01'
end_date = '2022-03-31'
I have set the above 2 dates as variables and then below I have tried to use them in the SQL query as follows:
attempted_solution = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date(
""" + start_date + """)
and date_trunc('day',a.created_at) <= date(
""" + end_date + """)
group by 1,2,3
"""
This does run but it gives me a dataframe with no data in it - i.e. no numbers. I am not sure what I am doing wrong - any assistance will really help.
try dropping date function and formatting:
my_query = f"... where date_trunc('day', a.created_at) >= {start_date}"
I was able to work it out as follows:
start_date = '2022-03-01'
end_date = '2022-03-31'
random_query = f"""
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('start_date')
and date_trunc('day',a.created_at) <= date('end_date')
group by 1,2,3
"""
It was amusing to see that all I needed to do was put start_date and end_date in ' ' as well. I noticed this simply by printing what query was showing in the script. Key thing here is to know how to troubleshoot.
Another option was also to use the .format() at the end of the query and inside it say .format(start_date = '2022-03-01', end_date = '2022-03-31').

SQLite query in Python using DATETIME and variables not working as expected

I'm trying to query a database using Python/Pandas. This will be a recurring request where I'd like to look back into a window of time that changes over time, so I'd like to use some smarts in how I do this.
In my SQLite query, if I say
WHERE table.date BETWEEN DATETIME('now', '-6 month') AND DATETIME('now')
I get the result I expect. But if I try to move those to variables, the resulting table comes up empty. I found out that the endDate variable does work but the startDate does not. Presumably I'm doing something wrong with the escapes around the apostrophes? Since the result is coming up empty it's like it's looking at DATETIME(\'now\') and not seeing the '-6 month' bit (comparing now vs. now which would be empty). Any ideas how I can pass this through to the query correctly using Python?
startDate = 'DATETIME(\'now\', \'-6 month\')'
endDate = 'DATETIME(\'now\')'
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN ? AND ?
'''
df = pd.read_sql_query(query, db, params=[startDate, endDate])
You can try with the string format as shown below,
startDate = "DATETIME('now', '-6 month')"
endDate = "DATETIME('now')"
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN {start_date} AND {end_data}
'''
df = pd.read_sql_query(query.format(start_date=startDate, end_data=endDate), db)
When you provide parameters to a query, they're treated as literals, not expressions that SQL should evaluate.
You can pass the function arguments rather than the function as a string.
startDate = 'now'
startOffset = '-6 month'
endDate = 'now'
endOffset = '+0 seconds'
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN DATETIME(?, ?) AND DATETIME(?, ?)
'''
df = pd.read_sql_query(query, db, params=[startDate, startOffset, endDate, endOffset])

Invalid Syntax (near date)

i found this code in stackoverflow. suits my requirements. i'm using an mysql db that consists of id, date, gender, name and surname. i have made few changes to the code so that i can obtain the date and other info between the start date and the end date.
import mysql
import mysql.connector
import command
import datetime
def findarbs(startdate, enddate):
con = mysql.connector.connect(user='root', password='root',
host='localhost',port='3306',database='testdb')
cur = con.cursor()
command = ("SELECT DISTINCT id, date FROM babyrop", WHERE 'date1'
> ('%s') AND 'date2' < ('%s'), ORDER BY date asc % (startdate,
enddate))
print ('command')
cursor.execute(command)
result = cur.fetchone()
print (result)
while result is not None:
print (result[1])
result = cursor.fetchone()
cur = con.cursor()
cur.execute('select id, date weight from babyrop')
data = cursor.fetchall()
cur.close()
con.close()
when i tried executing it i'm getting an error "invalid syntax" near date. let me know what corrections to are required.
You are terminating the string way too early. You also have invalid commas in the query.
command = ("SELECT DISTINCT id, date FROM babyrop", WHERE 'date1'
> ('%s') AND 'date2' < ('%s'), ORDER BY date asc % (startdate,
enddate))
"should" be
command = ("SELECT DISTINCT id, date FROM babyrop WHERE 'date1'
> ('%s') AND 'date2' < ('%s') ORDER BY date asc" % (startdate,
enddate))
I wrote "should" because you should not use string concatenation in queries since it exposes your code to SQL injection. You should use parametrized queries instead. It also recommended to use triple quotes for query strings so they can be broken apart nicely:
command = """SELECT DISTINCT id, date
FROM babyrop
WHERE 'date1' > ('%s') AND 'date2' < ('%s')
ORDER BY date asc"""
cursor.execute(command, (startdate, enddate))
Also, I'm pretty sure that you don't need the single quotes (') around the columns names and the placeholders in the where clause but that might be a MySQL thing.

PyMySql Changing date ranges in SQL query

I want to look at data for the past 7 days, so have generated the values:
from_date = str(date.today() - timedelta(7))
to_date = str(date.today()
This is so I can do a query using PyMySql such as:
data = """SELECT * FROM table WHERE date > 'from_date' AND date < 'to_date'"""
db_conn = pymysql.connect(host=xxx, user=xxx, password=xxx)
df = pd.read_sql(data, con=db_conn)
This doesn't work, and I've tried different quotation marks around from_date and to_date to try and do this. What is the best way to refer to this in PyMySql?
You need to pass variable values into the query through the query parameters:
data = """SELECT * FROM table WHERE date > %s AND date < %s"""
df = pd.read_sql(data, con=db_conn, params=[from_date, to_date])
Here %s in the query are positional placeholders.
You might also need to format the dates:
df = pd.read_sql(data, con=db_conn,
params=[from_date.strftime('%Y-%m-%d'),
to_date.strftime('%Y-%m-%d')])

Categories

Resources