I am pulling data from an online database using SQL/postgresql queries and converting it into a Python dataframe using Pandas. I want to be able to change the dates in the SQL query from one point in my Python script instead of having to manually go through every SQL query and change it one by one as there are many queries and many lines in each one.
This is what I have to begin with for example:
random_query = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('2022-03-01')
and date_trunc('day',a.created_at) <= date('2022-03-31')
group by 1,2,3
"""
Then I will read the data into Pandas as follows:
df_random_query = pd.read_sql(random_query, conn)
The connection above is to the database - the issue is not there so I am excluding that portion of code here.
What I have attempted is the following:
start_date = '2022-03-01'
end_date = '2022-03-31'
I have set the above 2 dates as variables and then below I have tried to use them in the SQL query as follows:
attempted_solution = """
select *
from table_A as a
where date_trunc('day',a.created_at) >= date(
""" + start_date + """)
and date_trunc('day',a.created_at) <= date(
""" + end_date + """)
group by 1,2,3
"""
This does run but it gives me a dataframe with no data in it - i.e. no numbers. I am not sure what I am doing wrong - any assistance will really help.
try dropping date function and formatting:
my_query = f"... where date_trunc('day', a.created_at) >= {start_date}"
I was able to work it out as follows:
start_date = '2022-03-01'
end_date = '2022-03-31'
random_query = f"""
select *
from table_A as a
where date_trunc('day',a.created_at) >= date('start_date')
and date_trunc('day',a.created_at) <= date('end_date')
group by 1,2,3
"""
It was amusing to see that all I needed to do was put start_date and end_date in ' ' as well. I noticed this simply by printing what query was showing in the script. Key thing here is to know how to troubleshoot.
Another option was also to use the .format() at the end of the query and inside it say .format(start_date = '2022-03-01', end_date = '2022-03-31').
Related
I have a function in my main Python file which gets called by main() and executes a SQL Merge (Upsert) statement using pyodbc from a different file & function. Concretely, the SQL statement traverses a source table with transaction details by distinct transaction datetimes and merges customers into a separate target table. The function that executes the statement and the function that returns the completed SQL statement are attached below.
When I run my Python script, it doesn't work as expected and inserts only around 70 rows (sometimes 69, 71, or 72) into the target customer table. However, when I use an identical SQL statement and execute it in the Microsoft SQL Server Management Studio console (attached below), it works fine and inserts 4302 rows (as expected).
I'm not sure what's wrong.. Would really appreciate any help!
SQL Statement Executor in Python main file:
def stage_to_dim(connection, cursor, now):
log(f"Filling {cfg.dim_customer} and {cfg.dim_product}")
try:
cursor.execute(sql_statements.stage_to_dim_statement(now))
connection.commit()
except Exception as e:
log(f"Error in stage_to_dim: {e}" )
sys.exit(1)
log("Stage2Dimensions complete.")
SQL Statement formulator in Python:
def stage_to_dim_statement(now):
return f"""
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM {cfg.stage_table} ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE {cfg.dim_customer} AS Target
USING (SELECT * FROM {cfg.stage_table} WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME, '{now}'), Target.LoadSource = '{cfg.ingest_file_path}'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID,
Source.AcquisitionDate, Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'{now}'), '{cfg.ingest_file_path}');
END
"""
SQL Statement from MS SQL Server Console:
DECLARE #dates table(id INT IDENTITY(1,1), date DATETIME)
INSERT INTO #dates (date)
SELECT DISTINCT TransactionDateTime FROM dbo.STG_CustomerTransactions ORDER BY TransactionDateTime;
DECLARE #i INT;
DECLARE #cnt INT;
DECLARE #date DATETIME;
SELECT #i = MIN(id) - 1, #cnt = MAX(id) FROM #dates;
WHILE #i < #cnt
BEGIN
SET #i = #i + 1
SET #date = (SELECT date FROM #dates WHERE id = #i)
MERGE dbo.DIM_CustomerDup AS Target
USING (SELECT * FROM dbo.STG_CustomerTransactions WHERE TransactionDateTime = #date) AS Source
ON Target.CustomerCodeNK = Source.CustomerID
WHEN MATCHED THEN
UPDATE SET Target.AquiredDate = Source.AcquisitionDate, Target.AquiredSource = Source.AcquisitionSource,
Target.ZipCode = Source.Zipcode, Target.LoadDate = CONVERT(DATETIME,'6/30/2022 11:53:05'), Target.LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource) VALUES (Source.CustomerID, Source.AcquisitionDate,
Source.AcquisitionSource, Source.Zipcode, CONVERT(DATETIME,'6/30/2022 11:53:05'), '../csv/cleaned_original_data.csv');
END
If you think carefully about what your final result ends up, you are actually just taking the latest row (by date) for each customer. So you can just filter the source using a standard row-number approach.
Exactly why the Python code didn't work properly is unclear, but the below query might work better. You are also doing SQL injection, which is dangerous and can also cause correctness problems.
Also you should always use a non-ambiguous date format.
MERGE dbo.DIM_CustomerDup AS t
USING (
SELECT *
FROM (
SELECT *,
rn = ROW_NUMBER() OVER (PARTITION BY s.CustomerID ORDER BY s.TransactionDateTime DESC)
FROM dbo.STG_CustomerTransactions s
) AS s
WHERE s.rn = 1
) AS s
ON t.CustomerCodeNK = s.CustomerID
WHEN MATCHED THEN
UPDATE SET
AquiredDate = s.AcquisitionDate,
AquiredSource = s.AcquisitionSource,
ZipCode = s.Zipcode,
LoadDate = SYSDATETIME(),
LoadSource = '../csv/cleaned_original_data.csv'
WHEN NOT MATCHED THEN
INSERT (CustomerCodeNK, AquiredDate, AquiredSource, ZipCode, LoadDate, LoadSource)
VALUES (s.CustomerID, s.AcquisitionDate, s.AcquisitionSource, s.Zipcode, SYSDATETIME(), '../csv/cleaned_original_data.csv')
;
I'm trying to replace a hardcoded date in my sql queury using '{Variable}' but I can't make it work.
Here is what I used to do (which works fine):
conn_drm = pyodbc.connect(
)
query_drm = '''
SELECT *
FROM
WHERE base_file_date = '2022-04-25'
'''
DF = pd.read_sql_query(query_drm, conn_drm)
Now, I would like do to somethig like this:
Yesterday = date.today() - timedelta(days=1)
Yesterday = str(Yesterday)
conn_drm = pyodbc.connect(
)
query_drm = '''
SELECT *
FROM
WHERE base_file_date = '{Yesterday}'
'''
DF = pd.read_sql_query(query_drm, conn_drm)
but I get the error:
Conversion failed when converting date and/or time from character string. (241)
Could someone plz help?
You have to use a formatable string to insert values like this. For this you should add a f in front of the string :
query_drm = f'''
SELECT *
FROM
WHERE base_file_date = '{Yesterday}'
'''
However, please take note that if you have any other input here other than the date and where a user can interact with, this method is dangerous and you should prefer to use the method here : pyodbc - How to perform a select statement using a variable for a parameter
I'm trying to query a database using Python/Pandas. This will be a recurring request where I'd like to look back into a window of time that changes over time, so I'd like to use some smarts in how I do this.
In my SQLite query, if I say
WHERE table.date BETWEEN DATETIME('now', '-6 month') AND DATETIME('now')
I get the result I expect. But if I try to move those to variables, the resulting table comes up empty. I found out that the endDate variable does work but the startDate does not. Presumably I'm doing something wrong with the escapes around the apostrophes? Since the result is coming up empty it's like it's looking at DATETIME(\'now\') and not seeing the '-6 month' bit (comparing now vs. now which would be empty). Any ideas how I can pass this through to the query correctly using Python?
startDate = 'DATETIME(\'now\', \'-6 month\')'
endDate = 'DATETIME(\'now\')'
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN ? AND ?
'''
df = pd.read_sql_query(query, db, params=[startDate, endDate])
You can try with the string format as shown below,
startDate = "DATETIME('now', '-6 month')"
endDate = "DATETIME('now')"
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN {start_date} AND {end_data}
'''
df = pd.read_sql_query(query.format(start_date=startDate, end_data=endDate), db)
When you provide parameters to a query, they're treated as literals, not expressions that SQL should evaluate.
You can pass the function arguments rather than the function as a string.
startDate = 'now'
startOffset = '-6 month'
endDate = 'now'
endOffset = '+0 seconds'
query = '''
SELECT some stuff
FROM table
WHERE table.date BETWEEN DATETIME(?, ?) AND DATETIME(?, ?)
'''
df = pd.read_sql_query(query, db, params=[startDate, startOffset, endDate, endOffset])
I have the following query in which i'm trying to pass start dates and end dates in a sql query.
def get_data(start_date,end_date):
ic = Connector()
q = f"""
select * from example_table a
where a.date between {start_date} and {end_date}
"""
result = ic.query(q)
return result
df = pd.DataFrame(get_data('2021-01-01','2021-01-31'))
print(df)
which leads to the following error:
AnalysisException: Incompatible return types 'STRING' and 'BIGINT' of exprs 'a.date' and '2021 - 1 - 1'.\n (110) (SQLExecDirectW)")
I have also tried to parse the dates as follows:
import datetime
start_date = datetime.date(2021,1,1)
end_date = datetime.date(2021,5,13)
df = pd.DataFrame(get_data(start_date,end_date))
but i still get the same error.
Any help will be much appreciated.
It seems to me, that it is because how you inject your values into sql query they don't get recognized as date values. Database will likely 2021-01-01 interpret as mathematical expression with 2019 being the result.
You should try put parentheses around your values.
q = f"""
select * from example_table a
where a.date between '{start_date}' and '{end_date}'
"""
Or preferably if your db library allows it don't inject your values directly
q = """
select * from example_table a
where a.date between %s and %s
"""
result = ic.query(q, (start_date, end_date))
EDIT: Some database libraries may use place-holder with different format than %s. You should probably consult documentation of db library you are using.
I have a booking system and I save the booked daterange in a DATERANGE column:
booked_date = Column(DATERANGE(), nullable=False)
I already know that I can access the actual dates with booked_date.lower or booked_date.upper
For example I do this here:
for bdate in room.RoomObject_addresses_UserBooksRoom:
unaviable_ranges['ranges'].append([str(bdate.booked_date.lower),\
str(bdate.booked_date.upper)])
Now I need to filter my bookings by a given daterange. For example I want to see all bookings between 01.01.2018 and 10.01.2018.
Usually its simple, because dates can be compared like this: date <= other date
But if I do it with the DATERANGE:
the_daterange_lower = datetime.strptime(the_daterange[0], '%d.%m.%Y')
the_daterange_upper = datetime.strptime(the_daterange[1], '%d.%m.%Y')
bookings = UserBooks.query.filter(UserBooks.booked_date.lower >= the_daterange_lower,\
UserBooks.booked_date.upper <= the_daterange_upper).all()
I get an error:
AttributeError: Neither 'InstrumentedAttribute' object nor 'Comparator' object associated with UserBooks.booked_date has an attribute 'lower'
EDIT
I found a sheet with useful range operators and it looks like there are better options to do what I want to do, but for this I need somehow to create a range variable, but python cant do this. So I am still confused.
In my database my daterange column entries look like this:
[2018-11-26,2018-11-28)
EDIT
I am trying to use native SQL and not sqlalchemy, but I dont understand how to create a daterange object.
bookings = db_session.execute('SELECT * FROM usersbookrooms WHERE booked_date && [' + str(the_daterange_lower) + ',' + str(the_daterange_upper) + ')')
The query
the_daterange_lower = datetime.strptime(the_daterange[0], '%d.%m.%Y')
the_daterange_upper = datetime.strptime(the_daterange[1], '%d.%m.%Y')
bookings = UserBooks.query.\
filter(UserBooks.booked_date.lower >= the_daterange_lower,
UserBooks.booked_date.upper <= the_daterange_upper).\
all()
could be implemented using "range is contained by" operator <#. In order to pass the right operand you have to create an instance of psycopg2.extras.DateRange, which represents a Postgresql daterange value in Python:
the_daterange_lower = datetime.strptime(the_daterange[0], '%d.%m.%Y').date()
the_daterange_upper = datetime.strptime(the_daterange[1], '%d.%m.%Y').date()
the_daterange = DateRange(the_dateranger_lower, the_daterange_upper)
bookings = UserBooks.query.\
filter(UserBooks.booked_date.contained_by(the_daterange)).\
all()
Note that the attributes lower and upper are part of the psycopg2.extras.Range types. The SQLAlchemy range column types do not provide such, as your error states.
If you want to use raw SQL and pass date ranges, you can use the same DateRange objects to pass values as well:
bookings = db_session.execute(
'SELECT * FROM usersbookrooms WHERE booked_date && %s',
(DateRange(the_daterange_lower, the_daterange_upper),))
You can also build literals manually, if you want to:
bookings = db_session.execute(
'SELECT * FROM usersbookrooms WHERE booked_date && %s::daterange',
(f'[{the_daterange_lower}, {the_daterange_upper})',))
The trick is to build the literal in Python and pass it as a single value – using placeholders, as always. It should avoid any SQL injection possibilities; only thing that can happen is that the literal has invalid syntax for a daterange. Alternatively you can pass the bounds to a range constructor:
bookings = db_session.execute(
'SELECT * FROM usersbookrooms WHERE booked_date && daterange(%s, %s)',
(the_daterange_lower, the_daterange_upper))
All in all it is easier to just use the Psycopg2 Range types and let them handle the details.