Trouble with SQL join, where, having clause - python

I'm having trouble understanding how to make a query that will show me 'the three most popular articles' in terms of views ('Status: 200 OK').
There are 2 tables I'm currently dealing with.
A Log table
An Articles table
The columns in these tables:
Table "public.log"
Column | Type | Modifiers
--------+--------------------------+--------------------------------------------------
path | text |
ip | inet |
method | text |
status | text |
time | timestamp with time zone | default now()
id | integer | not null default nextval('log_id_seq'::regclass)
Indexes:
and
Table "public.articles"
Column | Type | Modifiers
--------+--------------------------+-------------------------------------------------------
author | integer | not null
title | text | not null
slug | text | not null
lead | text |
body | text |
time | timestamp with time zone | default now()
id | integer | not null default nextval('articles_id_seq'::regclass)
Indexes:
.
So far, I've written this query based on my level and current understanding of SQL...
SELECT articles.title, log.status
FROM articles join log
WHERE articles.title = log.path
HAVING status = “200 OK”
GROUP BY title, status
Obviously, this is incorrect. I want to be able to pull the three most popular articles from the database and I know that 'matching' the 200 OK's with the "article title" will show or count in for me one "view" or hit. My thought process is like, I need to determine how many times that article.title=log.path (1 unique) shows up in the log database (with a status of 200 OK) by creating a query. My assignment is actually to write a program that will print the results with "[my code getting] the database to do the heavy lifting by using joins, aggregations, and the where clause.. doing minimal "post-processing" in the Python code itself."
Any explanation, idea, a tip is appreciated all of StackOverflow...

Perhaps the following is what you have in mind:
SELECT
a.title,
COUNT(*) AS cnt
FROM articles a
INNER JOIN log l
ON a.title = l.path
WHERE
l.lstatus = '200 OK'
GROUP BY
a.title
ORDER BY
COUNT(*) DESC
LIMIT 3;
This would return the three article titles having the highest status 200 hit counts. This answer assumes that you are using MySQL.

Related

How to compare two tables and identify a specific type of row to be returned?

I have two tables with relatively different data.
the photos table is a table with all the relevant meta data for photos such as user_id, photo_id, datetime, name, etc.
I have another table ratings that holds liked/disliked data for each respective photo. The columns in this table would have rater_id(for the person rating the picture), photo_id, and the rating (like/dislike).
The user would be presented a picture (at random) and then pick whether they liked it or not. Every time the image is loaded/presented it would have to be something that they have not yet rated.
What I'm trying to do is return a photo_id where the user has not yet rated it.
I've thought of using join or union, but I'm having difficulty understanding how to best use those (or any other solution) for this application. Where my confusion lies is how I can compare the ratings table against the photos table, to only return the photos that have not been rated by rater_id.
Sample data
photos table
id | photo_id
-------------------------
1 | photo_123
2 | photo_456
3 | photo_432
4 | photo_642
-------------------------
ratings table
id | photo_id | rater_id | rating
---------------------------------
1 | photo_123 | user2 | 1
2 | photo_456 | user2 | 1
3 | photo_123 | user1 | 1
4 | photo_642 | user2 | 1
--------------------------------
Sample Result: return photo_432 for user2 because it has not yet had a rating in ratings table
The canonical way would be not exists:
select p.*
from photos p
where not exists (select 1
from ratings r
where r.photo_id = p.id and
r.rater_id = #rater
)
order by rand()
limit 1;
There are more efficient ways to get a random row back if the table is big.

Get the intersection of two many-to-many relationship of specific values

N.B. I have tagged this with SQLAlchemy and Python because the whole point of the question was to develop a query to translate into SQLAlchemy. This is clear in the answer I have posted. It is also applicable to MySQL.
I have three interlinked tables I use to describe a book. (In the below table descriptions I have eliminated extraneous rows to the question at hand.)
MariaDB [icc]> describe edition;
+-----------+------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-----------+------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
+-----------+------------+------+-----+---------+----------------+
7 rows in set (0.001 sec)
MariaDB [icc]> describe line;
+------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| edition_id | int(11) | YES | MUL | NULL | |
| line | varchar(200) | YES | | NULL | |
+------------+--------------+------+-----+---------+----------------+
5 rows in set (0.001 sec)
MariaDB [icc]> describe line_attribute;
+------------+------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+------------+------+-----+---------+-------+
| line_id | int(11) | NO | PRI | NULL | |
| num | int(11) | YES | | NULL | |
| precedence | int(11) | YES | MUL | NULL | |
| primary | tinyint(1) | NO | MUL | NULL | |
+------------+------------+------+-----+---------+-------+
5 rows in set (0.001 sec)
line_attribute.precedence is the hierarchical level of the given heading. So if War and Peace has Books > Chapters, all of the lines have an attribute that corresponds to the Book they're in (e.g., Book 1 has precedence=1 and num=1) and an attribute for the Chapter they're in (e.g., Chapter 2 has precedence=2 and num=2). This allows me to translate the hierarchical structure of books with volumes, books, parts, chapters, sections, or even acts and scenes. The primary column is a boolean, so that each and every line has one attribute that is primary. If it is a book heading, it is the Book attribute, if it is a chapter heading, it is the Chapter attribute. If it is a regular line in text, it is a line attribute, and the precedence is 0 since it is not a part of the hierarchical structure.
I need to be able to query for all lines with a particular edition_id and that also have the intersection of two line_attributes.
(This would allow me to get all lines from a particular edition that are in, say, Book 1 Chapter 2 of War and Peace).
I can get all lines that have Book 1 with
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=1 AND line_attribute.num=1;
and I can get all lines that have Chapter 2:
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=2 AND line_attribute.num=1;
Except the second query returns each chapter 2 from every book in War and Peace.
How do I get from these two queries to just the lines from book 1 chapter 2?
Warning from Raymond Nijland in the comments:
Note for future readers.. Because this question is tagged MySQL.. MySQL does not support INTERSECT keyword.. MariaDB is indeed a fork off the MySQL source code but supports extra features which MySQL does not support.. In MySQL you can simulate the INTERSECT keyword with a INNER JOIN or IN()
Trying to write up a question on SO helps me get my thoughts clear and eventually solve the problem before I have to ask the question. The queries above are much clearer than my initial queries and the question pretty much answers itself, but I never found a clear answer that talks about the intersect utility, so I'm posting this answer anyway.
The solution was the INTERSECT operator.
The solution is simply the intersection of those two queries:
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=1 AND line_attribute.num=1
INTERSECT /* it is literally this simple */
SELECT
line.*
FROM
line
INNER JOIN
line_attribute
ON
line_attribute.line_id=line.id
WHERE
line.edition_id=2 AND line_attribute.precedence=2 AND line_attribute.num=2;
This also means I could get all of the book and chapter headings for a particular book by simply adding an additional constraint (line_attribute.primary=1).
This solution seems broadly applicable to me. Assuming, for instance, you have questions in a StackOverflow clone, which are tagged, you can get the intersection of questions with two tags (e.g., all posts that have both the SQLAlchemy and Python tags). I am certainly going to use this method for that sort of query.
I coded this up in MySQL because it helps me get the query straight to translate it into SQLAlchemy.
The SQLAlchemy query is this simple:
[nav] In [10]: q1 = Line.query.join(LineAttribute).filter(LineAttribute.precedence==1, LineAttribute.num==1)
[ins] In [11]: q2 = Line.query.join(LineAttribute).filter(LineAttribute.precedence==2, LineAttribute.num==1)
[ins] In [12]: q1.intersect(q2).all()
Hopefully the database structure in this question helps someone down the road. I didn't want to delete the question after I solved my own problem.

Web2py DAL find the record with the latest date

Hi I have a table with the following structure.
Table Name: DOCUMENTS
Sample Table Structure:
ID | UIN | COMPANY_ID | DOCUMENT_NAME | MODIFIED_ON |
---|----------|------------|---------------|---------------------|
1 | UIN_TX_1 | 1 | txn_summary | 2016-09-02 16:02:42 |
2 | UIN_TX_2 | 1 | txn_summary | 2016-09-02 16:16:56 |
3 | UIN_AD_3 | 2 | some other doc| 2016-09-02 17:15:43 |
I want to fetch the latest modified record UIN for the company whose id is 1 and document_name is "txn_summary".
This is the postgresql query that works:
select distinct on (company_id)
uin
from documents
where comapny_id = 1
and document_name = 'txn_summary'
order by company_id, "modified_on" DESC;
This query fetches me UIN_TX_2 which is correct.
I am using web2py DAL to get this value. After some research I have been successful to do this:
fmax = db.documents.modified_on.max()
query = (db.documents.company_id==1) & (db.documents.document_name=='txn_summary')
rows = db(query).select(fmax)
Now "rows" contains only the value of the modified_on date which has maximum value. I want to fetch the record which has the maximum date inside "rows". Please suggest a way. Help is much appreciated.
And my requirement extends to find each such records for each company_id for each document_name.
Your approach will not return complete row, it will only return last modified_on value.
To fetch last modified record for the company whose id is 1 and document_name "txn_summary", query will be
query = (db.documents.company_id==1) & (db.documents.document_name=='txn_summary')
row = db(query).select(db.documents.ALL, orderby=~db.documents.modified_on, limitby=(0, 1)).first()
orderby=~db.documents.modified_on will return records arranged in descending order of modified_on (last modified record will be first) and first() will select the first record. i.e. complete query will return last modified record having company 1 and document_name = "txn_summary".
There can be other/better way to achieve this. Hope this helps!

SQLite: Transposing results of a GROUP BY and filling in IDs with names

My question is rather specific, if you have a better title please suggest one. Also, formatting is bad - didn't know how to combine lists and codeblocks.
I have an SQLite3 database with the following (relevant parts of the) .schema:
CREATE TABLE users (id INTEGER PRIMARY KEY NOT NULL, user TEXT UNIQUE);
CREATE TABLE locations (id INTEGER PRIMARY KEY NOT NULL, name TEXT UNIQUE);
CREATE TABLE purchases (location_id INTEGER, user_id INTEGER);
CREATE TABLE sales (location_id integer, user_id INTEGER);
purchases has about 4.5mil entries, users about 300k, sales about 100k, and locations about 250 - just to gauge the data volume.
My desired use would be to generate a JSON object to be handed off to another application, very much condensed in volume by doing the following:
-GROUPing both purchases and sales into one common table BY location_id,user_id - IOW, getting the number of "actions" per user per location. That I can do, result is something like
loc | usid | loccount
-----------------------
1 | 1246 | 123
1 | 2345 | 1
13 | 1246 | 46
13 | 8732 | 4
27 | 2345 | 41
(At least it looks good, always hard to tell with such volumes; query:
select location_id,user_id,count(location_id) from
(select location_id,user_id from purchases
union all
select location_id,user_id from sales)
group by location_id,user_id order by user_id`
)
-Then, transposing that giant table such that I would get:
usid | loc1 | loc13 | loc27
---------------------------
1246 | 123 | 46 | 0
2345 | 1 | 0 | 41
8732 | 0 | 4 | 0
That I cannot do, and it's my absolutely crucial point for this question. I tried some things I found online, especially here, but I just started SQLite a little while ago and don't understand many queries.
-Lastly, translate the table into plain text in order to write it to JSON:
user | AAAA | BBBBB | CCCCC
---------------------------
zeta | 123 | 46 | 0
beta | 1 | 0 | 41
iota | 0 | 4 | 0
That I probably could do with quite a bit of experimentation and inner join, although I'm always very unsure what way is the best approach to handle such data volumes, hence I wouldn't mind a pointer.
The whole thing is written in Python's sqlite3 interface, if it matters. In the end, I'd love to have something I could just do a "for" loop per user over in order to generate the JSON, which would then of course be very simple. It doesn't matter if the query takes a long time (<10min would be nice), it's only run twice per day as a sort of backup. I've only got a tiny VPS available, but being limited to a single core the performance is as good as on my reasonably powerful desktop. (i5-3570k running Debian.)
The table headers are just examples because I wasn't quite sure if I could use integers for them (didn't discover the syntax if so), as long as I'm somehow able to look up the numeric part in the locations table I'm fine. Same for translating the user IDs into names. The number of columns is known beforehand - they're after all just INTEGER PRIMARY KEYs and I have a list() of them from some other operation. The number of rows can be determined reasonably quickly, ~3s, if need be.
Consider using subqueries to achieve your desired transposed output:
SELECT DISTINCT m.usid,
IFNULL((SELECT t1.loccount FROM tablename t1
WHERE t1.usid = m.usid AND t1.loc=1),0) AS Loc1,
IFNULL((SELECT t2.loccount FROM tablename t2
WHERE t2.usid = m.usid AND t2.loc=13),0) AS Loc13,
IFNULL((SELECT t3.loccount FROM tablename t3
WHERE t3.usid = m.usid AND t3.loc=27),0) AS Loc27
FROM tablename As m
Alternatively, you can use nested IF statements (or in the case of SQLite that uses CASE/WHEN) as derived table:
SELECT temp.usid, Max(temp.loc1) As Loc1,
Max(temp.loc13) As Loc13, Max(temp.loc27) As Loc27
FROM
(SELECT tablename.usid,
CASE WHEN loc=1 THEN loccount ELSE 0 As Loc1 END,
CASE WHEN loc=13 THEN loccount ELSE 0 As Loc13 END,
CASE WHEN loc=27 THEN loccount ELSE 0 As Loc27 END
FROM tablename) AS temp
GROUP BY temp.usid

Django group by id then select max timestamp

It might be a redundant question, but I have tried previous answers from other related topics and still can't figure it out.
I have a table Board_status looks like this (multiple status and timestamp for each board):
time | board_id | status
-------------------------------
2012-4-5 | 1 | good
2013-6-6 | 1 | not good
2013-6-7 | 1 | alright
2012-6-8 | 2 | good
2012-6-4 | 3 | good
2012-6-10 | 2 | good
Now I want to select all records from Board_status table, group all of them by board_id for distinct board_id, then select the latest status on each board. Basically end up with table like this (only latest status and timestamp for each board):
time | board_id | status
------------------------------
2013-6-7 | 1 | alright
2012-6-4 | 3 | good
2012-6-10 | 2 | good
I have tried:
b = Board_status.objects.values('board_id').annotate(max=Max('time')).values_list('board_id','max','status')
but doesn't seem like it is working. Still give me more than 1 record per board_id.
Which command should I use in Django to do this?
An update, this is the solution I use. Not the best, but it works for now:
b=[]
a = Board_status.objects.values('board_id').distinct()
for i in range(a.count()):
b.append(Board_status.objects.filter(board_id=a[i]['board_id']).latest('time'))
So I got all board_id, store into list a. Then for each board_id, do another query to get the latest time. Any better answer is still welcomed.
How will it work? You neither have filter nor distinct to filter out the duplicates. I am not sure if this can be easily done in a single django query. You should read more on:
https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.distinct
https://docs.djangoproject.com/en/1.4/topics/db/aggregation/
If you can't do it in 1 raw sql query, you can't do it with an OR mapper either as it's built on top of mysql (in your case). Can you tell me how you would do this via raw SQL?

Categories

Resources