SQL Counter Loop - python

Coming from Python/R but new to SQL...
Is there a way to use 'counter' in a SQL loop?
For example, how can I mimic the following simple Python for-loop in SQL:
counter = 7
For i in range(1,conter+1):
some_function_that_takes_i_as_argument
counter +=1
I am trying to replace a numeric value "i" (in a range) in a SQL query !

With SQL you need to develop a different way of thinking - not procedural, but in terms of what is the set you are looking to get? From your statement SELECT * FROM tableA WHERE date > SYSDATE - i". In his case seems you just want data from the last 8 days (today and 7 days prior). With SQL this is just a single statement (no loop required):
select *
from table_a
where date >= trunc(sysdate – 7)
order by date desc;
NOTE: The above user "date" as a column name. This is an extremely poor choice as date is and Oracle data type as well as an Oracle and SQL standard reserved word.
The query uses the trunc function since the Oracle date data type always includes time.

Related

Python - Filtering SQL query based on dates

I am trying to build a SQL query that will filter based on system date (Query for all sales done in the last 7 days):
import datetime
import pandas as pd
import psycopg2
con = p.connect(db_details)
cur = con.cursor()
df = pd.read_sql("""select store_name,count(*) from sales
where created_at between datetime.datetime.now() - (datetime.today() - timedelta(7))""",con=con)
I get an error
psycopg2.NotSupportedError: cross-database references are not implemented: datetime.datetime.now
You are mixing Python syntax into your SQL query. SQL is parsed and executed by the database, not by Python, and the database knows nothing about datetime.datetime.now() or datetime.date() or timedelta()! The specific error you see is caused by your Python code being interpreted as SQL instead and as SQL, datetime.datetime.now references the now column of the datetime table in the datetime database, which is a cross-database reference, and psycopg2 doesn't support queries that involve multiple databases.
Instead, use SQL parameters to pass in values from Python to the database. Use placeholders in the SQL to show the database driver where the values should go:
params = {
# all rows after this timestamp, 7 days ago relative to 'now'
'earliest': datetime.datetime.now() - datetime.timedelta(days=7),
# if you must have a date *only* (no time component), use
# 'earliest': datetime.date.today() - datetime.timedelta(days=7),
}
df = pd.read_sql("""
select store_name,count(*) from sales
where created_at >= %(latest)s""", params=params, con=con)
This uses placeholders as defined by the psycopg2 parameters documentation, where %(latest)s refers to the latest key in the params dictionary. datetime.datetime() instances are directly supported by the driver.
Note that I also fixed your 7 days ago expression, and replaced your BETWEEN syntax with >=; without a second date you are not querying for values between two dates, so use >= to limit the column to dates at or after the given date.
datetime.datetime.now() is not a proper SQL syntax, and thus cannot be executed by read_sql(). I suggest either using the correct SQL syntax that computes current time, or creating variables for each datetime.datetime.now() and datetime.today() - timedelta(7) and replacing them in your string.
edit: Do not follow the second suggestion. See comments below by Martijn Pieters.
Maybe you should remove that Python code inside your SQL, compute your dates in python and then use the strftime function to convert them to strings.
Then you'll be able to use them in your SQL query.
Actually, you do not necessarily need any params or computations in Python. Just use the corresponding SQL statement which should look like this:
select store_name,count(*)
from sales
where created_at >= now()::date - 7
group by store_name
Edit: I also added a group by which I think is missing.

How to retrieve only the year from timestamp column?

I have the following query that runs correctly on Postgres 9.3:
select distinct date_part('year', date_created)
from "Topic";
The intention is to return only the distinct years on the column date_created which is created thus:
date_created | timestamp with time zone | not null default now()
I need to turn it into a SQLAlchemy query but what I wrote does a select distinct on the date_created, not on the year, and returns the whole row, not just the distinct value:
topics = Topic.query.distinct(func.date_part('YEAR', Topic.date_created)).all()
How can I get only the distinct years from the table Topic?
Here are two variants:
Using ORM:
from sqlalchemy import func, distinct
result = session.query(distinct(func.date_part('YEAR', Topic.date_created)))
for row in result:
print(row[0])
SQL Expression:
from sqlalchemy import func, select, distinct
query = select([distinct(func.date_part('YEAR', Topic.date_created))])
for row in session.execute(query):
print(row[0])
SQL Alchemy syntax aside, you have a potential problem in your query.
Your data type is timestamptz (timestamp with time zone), which is a good choice. However, you cannot tell the year reliably form a timestamptz alone, you need to specify the time zone additionally. If you don't, the current time zone setting of the session is applied silently, which may or may not work for you.
Think of New Year's Eve: timestamptz '2016-01-01 04:00:00+00' - what year is it?
It's 2016 in Europe, but still 2015 in the USA.
You should make that explicit with the AT TIME ZONE construct to avoid sneaky mistakes:
SELECT extract(year FROM timestamptz '2016-01-01 04:00:00+00'
AT TIME ZONE 'America/New_York') AS year;
Detailed explanation:
Ignoring timezones altogether in Rails and PostgreSQL
date_part() and extract() do the same in Postgres, extract() is the SQL standard, so rather use that.
SQL Fiddle.
BTW, you could also just:
SELECT extract(year FROM date_created) AS year
FROM "Topic"
GROUP BY 1;
Use extract function:
session.query(func.extract(Topic.date_created, 'year'))
this is a concept code, not tested.

Get All of Single Column from Every Table in Schema

In our system, we have 1000+ tables, each of which has an 'date' column containing DateTime object. I want to get a list containing every date that exists within all of the tables. I'm sure there should be an easy way to do this, but I've very limited knowledge of either postgresql or sqlalchemy.
In postgresql, I can do a full join on two tables, but there doesn't seem to be a way to do a join on every table in a schema, for a single common field.
I then tried to solve this programmatically in python with sqlalchemy. For each table, I did created a select distinct for the 'date' column, then set that list of selectes that to the selects property of a CompoundSelect object, and executed. As one might expect from an ugly brute force query, it has ben running now for an hour or so, and I am unsure if it has broken silently somewhere and will never return.
Is there a clean and better way to do this?
You definitely want to do this on the server, not at the application level, due to the many round trips between application and server and likely duplication of data in intermediate results.
Since you need to process 1,000+ tables, you should use the system catalogs and dynamically query the tables. You need a function to do that efficiently:
CREATE FUNCTION get_all_dates() RETURNS SETOF date AS $$
DECLARE
tbl name;
BEGIN
FOR tbl IN SELECT 'public.' || tablename FROM pg_tables WHERE schemaname = 'public' LOOP
RETURN QUERY EXECUTE 'SELECT DISTINCT date::date FROM ' || tbl;
END LOOP
END; $$ LANGUAGE plpgsql;
This will process all the tables in the public schema; change as required. If the tables are in multiple schemas you need to insert your additional logic on where tables are stored, or you can make the schema name a parameter of the function and call the function multiple times and UNION the results.
Note that you may get duplicate dates from multiple tables. These duplicates you can weed out in the statement calling the function:
SELECT DISTINCT * FROM get_all_dates() ORDER BY 1;
The function creates a result set in memory, but if the number of distinct dates in the rows in the 1,000+ tables is very large, the results will be written to disk. If you expect this to happen, then you are probably better off creating a temporary table at the beginning of the function and inserting the dates into that temp table.
Ended up reverting back to a previous solution of using SqlAlchemy to run the queries. This allowed me to parallelize things and run a little faster, since it really was a very large query.
I knew a few things with the dataset that helped with this query- I only wanted distinct dates from each table, and that the dates were the PK in my set. I ended up using the approach from this wiki page. Code being sent in the query looked like the following:
WITH RECURSIVE t AS (
(SELECT date FROM schema.tablename ORDER BY date LIMIT 1)
UNION ALL SELECT (SELECT knowledge_date FROM schema.table WHERE date > t.date ORDER BY date LIMIT 1)
FROM t WHERE t.date IS NOT NULL)
SELECT date FROM t WHERE date IS NOT NULL;
I pulled the results of that query into a list of all my dates if they weren't already in the list, then saved that for use later. It's possible that it takes just as long as running it all in the pgsql console, but it was easier for me to save locally than to have to query the temp table in the db.

Wildcards in column name for MySQL

I am trying to select multiple columns, but not all of the columns, from the database. All of the columns I want to select are going to start with "word".
So in pseudocode I'd like to do this:
SELECT "word%" from searchterms where onstate = 1;
More or less. I am not finding any documentation on how to do this - is it possible in MySQL? Basically, I am trying to store a list of words in a single row, with an identifier, and I want to associate all of the words with that identifier when I pull the records. All of the words are going to be joined as a string and passed to another function in an array/dictionary with their identifier.
I am trying to make as FEW database calls as possible to keep speedy code.
Ok, here's another question for you guys:
There are going to be a variable number of columns with the name "word" in them. Would it be faster to do a separate database call for each row, with a generated Python query per row, or would it be faster to simply SELECT *, and only use the columns I needed? Is it possible to say SELECT * NOT XYZ?
No, SQL doesn't provide you with any syntax to do such a select.
What you can do is ask MySQL for a list of column names first, then generate the SQL query from that information.
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'your_table'
AND column_name LIKE 'word%'
let's you select the column names. Then you can do, in Python:
"SELECT * FROM your_table WHERE " + ' '.join(['%s = 1' % name for name in columns])
Instead of using string concatenation, I would recommend using SQLAlchemy instead to do the SQL generating for you.
However, if all you are doing is limit the number of columns there is no need to do a dynamic query like this at all. The hard work for the database is selecting the rows; it makes little difference to send you 5 columns out of 10, or all 10.
In that case just use a "SELECT * FROM ..." and use Python to pick out the columns from the result set.
No, you cannot dynamically produce the list of columns to be selected. It will have to be hardcoded in your final query.
Your current query would produce a result set with one column and the value of that column would be the string "word%" in all rows that satisfy the condition.
You can generate the list of column names first by using
SHOW COLUMNS IN tblname LIKE "word%"
Then loop through the cursor and generate SQL statement uses all the columns from the query above.
"SELECT {0} FROM searchterms WHERE onstate = 1".format(', '.join(columns))
This could be helpful: MySQL wildcard in select
In conclusion it is not possible in MySQL directly.
What you could do as a dirty workaround is get all the column names from the table with an initial query (http://dev.mysql.com/doc/refman/5.0/en/show-columns.html) and then compare in python if the name matches your pattern. Afterwards you could do the MySQL select statement with the found column names like this:
SELECT word1, word2, word3 from searchterms where onstate = 1;

How do I GROUP BY on every given increment of a field value?

I have a Python application. It has an SQLite database, full of data about things that happen, retrieved by a Web scraper from the Web. This data includes time-date groups, as Unix timestamps, in a column reserved for them. I want to retrieve the names of organisations that did things and count how often they did them, but to do this for each week (i.e. 604,800 seconds) I have data for.
Pseudocode:
for each 604800-second increment in time:
select count(time), org from table group by org
Essentially what I'm trying to do is iterate through the database like a list sorted on the time column, with a step value of 604800. The aim is to analyse how the distribution of different organisations in the total changed over time.
If at all possible, I'd like to avoid pulling all the rows from the db and processing them in Python as this seems a) inefficient and b) probably pointless given that the data is in a database.
Not being familiar with SQLite I think this approach should work for most databases, as it finds the weeknumber and subtracts the offset
SELECT org, ROUND(time/604800) - week_offset, COUNT(*)
FROM table
GROUP BY org, ROUND(time/604800) - week_offset
In Oracle I would use the following if time was a date column:
SELECT org, TO_CHAR(time, 'YYYY-IW'), COUNT(*)
FROM table
GROUP BY org, TO_CHAR(time, 'YYYY-IW')
SQLite probably has similar functionality that allows this kind of SELECT which is easier on the eye.
Create a table listing all weeks since the epoch, and JOIN it to your table of events.
CREATE TABLE Weeks (
week INTEGER PRIMARY KEY
);
INSERT INTO Weeks (week) VALUES (200919); -- e.g. this week
SELECT w.week, e.org, COUNT(*)
FROM Events e JOIN Weeks w ON (w.week = strftime('%Y%W', e.time))
GROUP BY w.week, e.org;
There are only 52-53 weeks per year. Even if you populate the Weeks table for 100 years, that's still a small table.
To do this in a set-based manner (which is what SQL is good at) you will need a set-based representation of your time increments. That can be a temporary table, a permanent table, or a derived table (i.e. subquery). I'm not too familiar with SQLite and it's been awhile since I've worked with UNIX. Timestamps in UNIX are just # seconds since some set date/time? Using a standard Calendar table (which is useful to have in a database)...
SELECT
C1.start_time,
C2.end_time,
T.org,
COUNT(time)
FROM
Calendar C1
INNER JOIN Calendar C2 ON
C2.start_time = DATEADD(dy, 6, C1.start_time)
INNER JOIN My_Table T ON
T.time BETWEEN C1.start_time AND C2.end_time -- You'll need to convert to timestamp here
WHERE
DATEPART(dw, C1.start_time) = 1 AND -- Basically, only get dates that are a Sunday or whatever other day starts your intervals
C1.start_time BETWEEN #start_range_date AND #end_range_date -- Period for which you're running the report
GROUP BY
C1.start_time,
C2.end_time,
T.org
The Calendar table can take whatever form you want, so you could use UNIX timestamps in it for the start_time and end_time. You just pre-populate it with all of the dates in any conceivable range that you might want to use. Even going from 1900-01-01 to 9999-12-31 won't be a terribly large table. It can come in handy for a lot of reporting type queries.
Finally, this code is T-SQL, so you'll probably need to convert the DATEPART and DATEADD to whatever the equivalent is in SQLite.

Categories

Resources