Using a table in python UDF in Redshift

Using a table in python UDF in Redshift - python

I need to create a python UDF(user-defined function) in redshift, that would be called in some other procedure. This python UDF takes two date values and compares those dates within the given start and end date, and check for the occurrence of these intermediate dates in some list.
This list needs to collect it's values from another table's column. Now the issue is, python UDF are defined in plpythonplu language and they don't recognize any sql. What should I do to make this list out of the table's column?
This is my function:
create or replace function test_tmp (ending_date date, starting_date date)
returns integer
stable
as $$
def get_working_days(ending_date , starting_date ):
days=0
if start_date is not None and end_date is not None:
for n in range(int ((ending_date - starting_date).days)):
btw_date= (start_date + timedelta(n)).strftime('%Y-%m-%d')
if btw_date in date_list:
days=days+1
return days
return 0
return get_working_days(ending_date,starting_date)
$$ language plpythonu;
Now, I need to create this date_list as something like:
date_list = [str(each["WORK_DATE"]) for each in (select WORK_DATE from public.date_list_table).collect()]
But, using this line in the function obviously gives an error, as select WORK_DATE from public.date_list_table is SQL.
Following is the structure of table public.date_list_table:
CREATE TABLE public.date_list
(
work_date date ENCODE az64
)
DISTSTYLE EVEN;
Some sample values for this table (actually this table stores only the working days values for the entire year):
insert into date_list_table values ('2021-07-01'),('2021-06-30'),('2021-06-29');

An Amazon Redshift Scalar SQL UDF - Amazon Redshift cannot access any tables. It needs to be self-contained by passing all the necessary information into the function. Or, you could store the date information inside the function so it doesn't need to access the table (not unreasonable, since it only needs to hold exceptions such as public holidays on weekdays).
It appears that your use-case is to calculate the number of working days between two dates. One way that this is traditionally solved is to create a table calendar with one row per day and columns providing information such as:
Work Day (Boolean)
Weekend (Boolean)
Public Holiday (Boolean)
Month
Quarter
Day of Year
etc.
You can then JOIN or query the table to identify the desired information, such as:
SELECT COUNT(*) FROM calendar WHERE work_day AND date BETWEEN start_date AND end_date

Related

Pandas/SQL only return results where groupby has at least n rows

I have a stock database of the following form where data is given by the minute.
I want to fetch all results which have at least N rows for a particular ticker, for a particular date. (when grouped by ticker and date)
The purpose is to ignore missing data for a given ticker on a given date. There should be at least 629 rows for any given ticker,date combination. Anything less than this, I want to ignore that combination.
I tried something like the following but it wasn't working
start_date = '2021-11-08'
SQL = "SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume FROM candles WHERE id in (SELECT id from candles WHERE session_type='Live' GROUP BY datetime, ticker HAVING COUNT(ID) >= 629) AND datetime >= '{} 00:00:00'::timestamp ORDER BY ticker, datetime_utc ASC".format(start_date)
Could anyone point me at the right SQL incantation? I am using postgres > 9.6.
sample data (apologies i tried to set up an SQL fiddle but it wasn't allowing me to due to character limitations, then data type limitations. The data is below in json, set to be read by pandas)
df = pd.io.json.read_json('{"datetime":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"datetime_utc":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"ticker":{"12372":"AAPL","12351":"AAPL","12493":"AAPL","15210":"AAPL","16212":"NET","16009":"NET","12213":"AAPL","16386":"NET","13580":"NET","14733":"AAPL","12086":"AAPL","14802":"AAPL","14407":"AAPL","14356":"AAPL","16086":"NET","15437":"AAPL","14115":"NET","14331":"AAPL","12457":"AAPL","14871":"AAPL"},"open":{"12372":148.29,"12351":148.44,"12493":148.1,"15210":148.68,"16212":199.72,"16009":202.27,"12213":150.21,"16386":198.65,"13580":194.9,"14733":147.94,"12086":150.2,"14802":147.87,"14407":148.1,"14356":148.09,"16086":193.82,"15437":148.01,"14115":194.64,"14331":148.07,"12457":148.12,"14871":148.2},"close":{"12372":148.32,"12351":148.44,"12493":148.15,"15210":148.69,"16212":199.32,"16009":202.52,"12213":150.25,"16386":198.57,"13580":194.96,"14733":147.99,"12086":150.17,"14802":147.9,"14407":148.1,"14356":147.99,"16086":194.43,"15437":148.01,"14115":194.78,"14331":148.05,"12457":148.11,"14871":148.28},"session_type":{"12372":"Live","12351":"Live","12493":"Post","15210":"Pre","16212":"Live","16009":"Live","12213":"Pre","16386":"Live","13580":"Live","14733":"Live","12086":"Pre","14802":"Post","14407":"Post","14356":"Live","16086":"Live","15437":"Live","14115":"Live","14331":"Live","12457":"Post","14871":"Post"},"high":{"12372":148.3600006104,"12351":148.4700012207,"12493":148.15,"15210":148.69,"16212":199.8399963379,"16009":202.5249938965,"12213":150.25,"16386":198.6499938965,"13580":195.0299987793,"14733":147.9900054932,"12086":150.2,"14802":147.9,"14407":148.1,"14356":148.1049957275,"16086":194.4400024414,"15437":148.0399932861,"14115":195.1699981689,"14331":148.0800018311,"12457":148.12,"14871":148.28},"low":{"12372":148.26,"12351":148.38,"12493":148.06,"15210":148.68,"16212":199.15,"16009":202.27,"12213":150.2,"16386":198.49,"13580":194.79,"14733":147.93,"12086":150.16,"14802":147.85,"14407":148.1,"14356":147.98,"16086":193.82,"15437":148.0,"14115":194.64,"14331":148.01,"12457":148.07,"14871":148.2},"volume":{"12372":99551.0,"12351":68985.0,"12493":0.0,"15210":0.0,"16212":9016.0,"16009":2974.0,"12213":0.0,"16386":1395.0,"13580":5943.0,"14733":59854.0,"12086":0.0,"14802":0.0,"14407":0.0,"14356":341196.0,"16086":8715.0,"15437":45495.0,"14115":16535.0,"14331":173785.0,"12457":0.0,"14871":0.0},"date":{"12372":1636502400000,"12351":1636502400000,"12493":1636502400000,"15210":1636588800000,"16212":1636588800000,"16009":1636502400000,"12213":1636502400000,"16386":1636588800000,"13580":1636502400000,"14733":1636588800000,"12086":1636502400000,"14802":1636588800000,"14407":1636502400000,"14356":1636502400000,"16086":1636502400000,"15437":1636588800000,"14115":1636502400000,"14331":1636502400000,"12457":1636502400000,"14871":1636588800000}}')
Expected result
if N = 4, then the results will exclude NET on 2021-11-11. (count of all shown here for illustration)
Like below (sample only)

You can use analytical function (window function) to count rows for each date and ticker combination.
select datetime, datetime_utc, ticker,open,close,session_type, high,low,volume
from (
SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume,
COUNT(*) OVER(PARTITION BY CAST(datetime AS DATE), ticker) as C
FROM candles
)
WHERE C >= 5
I am not familiar with postgres, so it may contain syntax problem.

You can use the pandasql with a nested groupby to count the daily data points for all tickers and use an inner join
import pandasql as ps
N=4
df_select = ps.sqldf(f"select df.* from (select count(*) as count, ticker, date from df group by date, ticker having count >= {N}) df_counts left join df on df.ticker=df_counts.ticker and df.date=df_counts.date")
Output:

How to get the difference of two values of same column in django

So I have table T in the Postgres database which has day to day values of the users. From that table, I need to find the difference between yesterday's value and today's value of an attribute C of all the users and return the IDs of the users' for which the difference is greater than 10.
Can this whole thing be done in one or two queries? The solutions I have in my mind are naive and will need separate queries for each user. Anything more efficient than this will be also great.

Without the schema for table T this is just a shot in the dark. Here it goes:
WITH y_date AS (
SELECT
user_id, c
FROM
t
WHERE
timestamp_fld::date = current_date - (interval - '1 day')
),
WITH t_date AS (
SELECT
user_id, c
FROM
t
WHERE
timestamp_fld::date = current_date
)
SELECT
user_id
FROM
y_date
JOIN
t_date
ON
y_date.user_id = t_date.user_id
WHERE
abs(t_date.c - y_date.c) > 10
Using CTE(Common Table Expression) to pull the values for yesterday and today and then joining those results to get the absolute difference for each user and only return those that are greater then 10. The above is based on a bunch of assumptions. A better answer will depend on more information.

Filtering and extracting specific dates from SQLite file using python(Anaconda)

I'm trying to filter and extract a specific date with the month 2 inside my SQLite database using python and calculating their average monthly prices. This is what I've got so far...
The CurrentMonth variable currently holds the value 02. I keep receiving invalid syntax errors. My database is here:

Your syntax is invalid in SQLite. I think that you mean:
select * from stock_stats where strftime('%m', date) + 0 = ?
Rationale: strftime('%m', date) extracts the month part from the date column, and returns a string like '02'. You can just add 0 to force the conversion to a numeric value.
Note that:
you should also filter on the year part, to avoid mixing data from different years
a more efficient solution would be to pass 2 parameters, that define the start and end of the date range; this would avoid the need to use date functions on the date column.
date >= ? and date < ?

Comparing timestamp to date in Python SQLAlchemy

I have an API in Python using sqlalchemy.
I have a string which represents a date in ISO format. I convert it using datetime.strptime like so: datetime.strptime(ToActionDateTime, '%Y-%m-%dZ').
Now I have to compare the value of a table's column which is a timestamp to that date.
After converting the initial ISO string, an example result looks like this 2018-12-06 00:00:00. I have to compare it for equality depending on date and not time but I can't manage to get it right. Any help would be appreciated.
Sample Python code:
ToActionDateTimeObj = datetime.strptime(ToActionDateTime, '%Y-%m-%dZ')
query = query.filter(db.c.Audit.ActionDateTime <= ToActionDateTimeObj)
Edit:
I have also tried to implement cast to both parts of the equation but it does not work either. I can't manage to get the right result when the selected date matches the date of the timestamp.
from sqlalchemy import Date, cast
ToActionDateTimeObj = datetime.strptime(ToActionDateTime, '%Y-%m-%dZ')
query = query.filter(cast(db.c.Audit.ActionDateTime, Date) <= cast(ToActionDateTimeObj, Date))

Since Oracle DATE datatype actually stores both date and time, a cast to DATE will not rid the value of its time portion, as it would in most other DBMS. Instead the function TRUNC(date [, fmt]) can be used to reduce a value to its date portion only. In its single argument form it truncates to the nearest day, or in other words uses 'DD' as the default model:
ToActionDateObj = datetime.strptime(ToActionDateTime, '%Y-%m-%dZ').date()
...
query = query.filter(func.trunc(db.c.Audit.ActionDateTime) <= ToActionDateObj)
If using the 2-argument form, then the precision specifier for day precision is either 'DDD', 'DD', or 'J'.
But this solution hides the column ActionDateTime from possible indexes. To make the query index friendly increment the date ToActionDateObj by one day and use less than comparison:
ToActionDateObj = datetime.strptime(ToActionDateTime, '%Y-%m-%dZ').date()
ToActionDateObj += timedelta(days=1)
...
query = query.filter(db.c.Audit.ActionDateTime < ToActionDateObj)

postgres how to materialize a row for each value in an enum array

I have a postgres table that stores recurring appointments called appointments. I also have an enum type called day_of_week.
day_of_week is an enum that looks like:
select enum_range(null::ck_day_of_week);
returns
{Monday,Tuesday,Thursday,Sunday,Saturday,Friday,Wednesday}
The appointments table's DDL is:
create table appointments
(
id serial primary key not null,
title varchar not null
recurrence_start_date timestampz not null,
recurrence_end_date timestampz,
recurrence_days_of_week _ck_day_of_week -- this is actually an array of enums, not just an enum. I don't fully understand why the DDL looks like this
);
A query of
select id, title, recurrence_days_of_week from appointments where recurrence_start_date <= now() and (recurrence_end_date is null or recurrence_end_date >= now());
returns something like this:
1,My Recurring Appointment,{Wednesday,Thursday,Friday,Saturday,Sunday}
My question is, I need to 'materialize' this result as a set of rows for my UI view. Specifically I'd like to query the DB and have it return rows like:
id,My Recurring Appointment,date
where a unique key would likely be the appointment id and specific date and where date is a essentially a combination of recurrence_start_date + day of week (day of week is an offset between 0 and 6).
In postgres, how do I select from a table that has a column which is an enum array and have it return a --virtual-- single row of the --real-- same row for each enum value in the array? I'm thinking of a view or something, but I'm now sure how to write the select statement.
Bonus points if this can be translated into SQLAlchemy....
I'm also open and have considered just creating an appointments table that stores each appointment with its specific date. But if I do this...in theory, for appointments with recurring dates where there is no end date, the table could be infinitely large, so I've been trying to avoid that.

First, I'd not suggest using id as a column name ever.
Second, I'd use built in date/time functions to achieve the 1st objective. More specificly the combination extract function with dow parameter (day of week) and generate_series function with.
Make sure to check fiddle in the References section.
Query returning list of appointments for each day of week:
create or replace function fnc_appointment_list_dow_by_id(_app_id int)
returns table (__app_id int, __title varchar, __recurrence_start_date timestamptz, __recurrence_appointment_day timestamptz, __reccurence_dow_offset int) as
$body$
select
app.app_id,
app.app_title,
app.app_recurrence_start_date,
_appointment_days._day,
extract(dow from _day)::int
from
appointments app,
generate_series(app.app_recurrence_start_date, app.app_recurrence_end_date, interval '1 day') _appointment_days(_day)
where app.app_id=_app_id;
$body$
language sql stable; -- consider changing function volatility as needed.
-- select * from fnc_appointment_list_dow_by_id(1);
Query returning array of appointment days of week:
create or replace function fnc_appointment_array_dow_by_id(_app_id int)
returns table (__app_id int, __title varchar, __recurrence_start_date timestamptz, __recurrence_appointment_day timestamptz[], __reccurence_dow_offset int[]) as
$body$
select
app.app_id,
app.app_title,
app.app_recurrence_start_date,
array_agg(_day::timestamptz),
array_agg(extract(dow from _day)::int)
from
appointments app,
generate_series(app.app_recurrence_start_date, app.app_recurrence_end_date, interval '1 day') as _appointment_days(_day)
where app.app_id=_app_id
group by app.app_id, app.app_title, app.app_recurrence_start_date;
$body$
language sql stable; -- consider changing function volatility as needed.
-- select * from fnc_appointment_array_dow_by_id(1);
If either of functions is what you are looking for let me know and we can move on to next objectives.
References:
Documentation on date/time functions (look for extract
function)
Documentation on generate_series functions (look for
generate_series function with arguments of timestamp type)
Fiddle (Don't mind array of timestamptz being displayed as array of ints)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using a table in python UDF in Redshift - python

Related

Pandas/SQL only return results where groupby has at least n rows

How to get the difference of two values of same column in django

Filtering and extracting specific dates from SQLite file using python(Anaconda)

Comparing timestamp to date in Python SQLAlchemy

postgres how to materialize a row for each value in an enum array

Categories

Resources