Nested Query in From and Having clauses - python

Schema:
CREATE TABLE companies (
company_name varchar(200),
market varchar(200),
funding_total integer,
status varchar(20),
country varchar(10),
state varchar(10),
city varchar(30),
funding_rounds integer,
founded_at date,
first_funding_at date,
last_funding_at date,
PRIMARY KEY (company_name,market,city)
);
Query:
What is/are the state(s) that has/have the largest number(s) of startups in the "Security" market (i.e. market column contains the word "Security"), listing all ties?
Code:
db.executescript("""
DROP VIEW IF EXISTS q3;
select companies.state, count(*)as total
from companies
where companies.market like '%Security%'
group by companies.state
having count(*) =
(
select max(countGroup) as maxNumber
from (select C.state, count(*) as countGroup
from companies as C
where C.market like '%Security%'
group by C.state)
);
"""
EDIT:
There is still an error because the output/result is empty. Any ideas why?

Try this. (Please adapt syntax of your RDBMS)
select state, total from
( select companies.state, count(*)as total
from companies
where companies.market like '%Security%'
group by companies.state
) as countgroups
where total =
(
select max(countGroup) as maxNumber
from (select C.state, count(*) as countGroup
from companies as C
where C.market like '%Security%'
group by C.state)
);
Alternatively:
select state, total from
(select companies.state, count(*)as total
from companies
where companies.market like '%Security%'
group by companies.state
) order by 2 desc
limit 1; --please adapt syntax of your RDBMS

Surround the subquery with parentheses

Related

Django Delete duplicates rows and keep the last using SQL query

I need to execute a SQL query that deletes the duplicated rows based on one column and keep the last record. Noting that it's a large table so Django ORM takes very long time so I need SQL query instead. the column name is customer_number and table name is pages_dataupload. I'm using sqlite.
Update: I tried this but it gives me no such column: row_num
cursor = connection.cursor()
cursor.execute(
'''WITH cte AS (
SELECT
id,
customer_number ,
ROW_NUMBER() OVER (
PARTITION BY
id,
customer_number
ORDER BY
id,
customer_number
) row_num
FROM
pages.dataupload
)
DELETE FROM pages_dataupload
WHERE row_num > 1;
'''
)
You can work with an Exists subquery [Django-doc] to determine efficiently if there is a younger DataUpload:
from django.db.models import Exists, OuterRef
DataUpload.objects.filter(Exists(
DataUpload.objects.filter(
pk__gt=OuterRef('pk'), customer_number=OuterRef('customer_number')
)
)).delete()
This will thus check for each DataUpload if there exists a DataUpload with a larger primary key that has the same customer_number. If that is the case, we will remove that DataUpload.
I have solved the problem with the below query, is there any way to reset the id field after removing the duplicate?
cursor = connection.cursor()
cursor.execute(
'''
DELETE FROM pages_dataupload WHERE id not in (
SELECT Max(id) FROM pages_dataupload Group By Dial
)
'''
)

Hive, I have a QA dataset (ID, time, content, role) ordered by timestamp. How can transpose it to a format like (ID, roleA, roleB)?

I want to output a data like below:
ID roleA role B
xxx is customer service? yes, how can i help you, how can i help you
xxx is customer service? yes
xxx great, why this happens wait a minute, let me check
I have no idea how to solve it using either sql or python.
This is a gap-and-islands problem with conditional aggregation:
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2);
Then with this, you can reaggregate to get what you want:
with x as (
select biz_id, send_role, min(create_time) as create_time,
concat_ws(collect_list(content), ' ') as content
from (select t.*,
row_number() over (partition by biz_id order by create_time) as seqnum,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum_2,
from t
) t
group by biz_id, send_role, (seqnum - seqnum_2)
)
select biz_id,
max(case when send_role = 2 then content end),
max(case when send_role = 3 then content end)
from (select x.*,
row_number() over (partition by biz_id, send_role order by create_time) as seqnum
from x
) x
group by biz_id, seqnum;
Note: This may puts the content on "adjacent" rows in an arbitrary order. Getting these in the "right" order is tricky . . . in your sample data, the date/times are identical so there is no obvious ordering column.

Get the most popular product in the same category

I have 2 tables in my Postgres database.
First table 'products': https://imgur.com/a/Ru0IcuY
CREATE TABLE if not exists PRODUCTS (product_id varchar PRIMARY KEY, product_name varchar, price int, gender varchar,
category varchar, sub_category varchar, sub_sub_category varchar);
And my second table 'pop_products': https://imgur.com/a/6U3zBro
This contains the product id's and the the number of times they've been sold.
Note: the 'product_id' in 'pop_products' is not a foreign key
CREATE TABLE if not exists POP_PRODUCTS (product_id varchar PRIMARY KEY, freq int);
My goal is to find the most popular product with the same category.
My code up until now:
SELECT product_id FROM products
HERE category LIKE '""" + category[0] + """'
AND product_id NOT LIKE CAST(""" + productid + """ AS varchar)
I've been sitting here scratching my head for the last 30 minutes, trying to figure out a solution but no bueno till now.
Hmmm . . . you can use distinct on after joining the tables together:
select distinct on (p.category) p.*, pp.freq
from pop_products pp join
products p
using (product_id)
order by p.category, pp.freq desc;

troubles with 'WHERE...IN' clause

I'm trying to run the following query through pandasql, but the output I get is not what I was expecting. I was expecting to get a table with exactly 800 rows as I am selecting the only employee_day_transmitters of the table employee_days_transmitters, but what I get is a table with more than 800 rows. What's wrong? How can I get exactly 800 rows related to the employee_day_transmitters selected in the table employee_days_transmitters?
query_text = '''WITH employee_days_transmitters AS (
SELECT DISTINCT
employeeId
, theDate
, transmitterId
, employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId AS employee_day_transmitter
FROM
table1
WHERE variable='rpv'
ORDER BY
RANDOM()
LIMIT
800
)
SELECT
*
FROM
table1
WHERE
(employeeId || '-' || CAST(theDate AS STRING) || '-' || transmitterId) IN (SELECT employee_day_transmitter FROM employee_days_transmitters) AND variable = 'rpv'
'''
table2=pandasql.sqldf(query_text,globals())
You are using DISTINCT in the CTE, so I suspect you have duplicates for the combination of the columns employeeId, theDate, transmitterId and this why you get more than 800 rows.
You select 800 rows in the CTE but when you use the operator IN in your main query, all the rows that satisfy your conditions are returned, which are more than 800.
But why do you use the CTE?
You could apply the conditions directly in the main query:
SELECT DISTINCT employeeId, theDate, transmitterId
FROM table1
WHERE variable='rpv'
ORDER BY RANDOM()
LIMIT 800
Or maybe with ROW_NUMBER() window function:
WITH cte AS (
SELECT id
FROM (
SELECT rowid id,
ROW_NUMBER() OVER (PARTITION BY employeeId, theDate, transmitterId ORDER BY RANDOM()) rn
FROM table1
WHERE variable='rpv'
)
WHERE rn = 1
ORDER BY RANDOM()
LIMIT 800
)
SELECT *
FROM table1
WHERE rowid IN cte

ID tag is not auto incrementing the numbers in python

I am learning python and trying to replicate what online tutorials do. I am trying to create a python desktop app where data is store in Postgresql. code is added below,
`cur.execute("CREATE TABLE IF NOT EXISTS book (id INTEGER PRIMARY KEY, title text, author text, year integer, isbn integer)")`
problem is with (id INTEGER PRIMARY KEY), when i execute the code its showing none in place of 1st index. i want to show numbers.
please help
this is for Python 3.7.3, psycopg2==2.8.3,
def connect():
conn=sqlite3.connect("books.db")
cur=conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS book (id INTEGER PRIMARY KEY,
title text, author text, year integer, isbn integer)")
conn.commit()
conn.close()
the result I am expecting is auto increment of numbers in 1st index where as presently it shows NONE.
below is the present and expected result again.
none title auther year isbn
01 title auther year isbn
Trying to use a Cursor execute will not work for a CREATE statement and hence the NONE. See below for an example.
Re Indexes :-
There will be no specific index as column_name INTEGER PRIMARY KEY is special in that it defines the column as an alias of the rowid column which is a special intrinsic index using the underlying B-tree storage engine.
When a row is inserted then if no value is specified for the column (e.g. INSERT INTO book (title,author, year, isbn) VALUES ('book1','The Author','1999','1234567890') then id will be 1 and typically (but not certainly) the next row inserted will have an id of 2 and so on.
If after adding some rows you use SELECT * FROM book, then the rows will be ordered according to the id as no other index is specified/used.
Perhaps have a look at Rowid Tables.
Example
Perhaps consider the following example :-
DROP TABLE IF EXISTS book;
CREATE TABLE IF NOT EXISTS book (id INTEGER PRIMARY KEY, title text, author text, year integer, isbn integer);
INSERT INTO book (title,author, year, isbn) VALUES
('book1','The Author','1999','1234567890'),
('book2','Author 2','1899','2234567890'),
('book3','Author 3','1799','3234567890')
;
INSERT INTO book VALUES (100,'book10','Author 10','1999','4234567890'); --<<<<<<<<<< specific ID
INSERT INTO book (title,author, year, isbn) VALUES
('book11','Author 11','1999','1234567890'),
('book12','Author 12','1899','2234567890'),
('book13','Author 13','1799','3234567890')
;
INSERT INTO book VALUES (10,'book10','Author 10','1999','4234567890'); --<<<<<<<<<< specific ID
SELECT * FROM book;
This :-
DROPs the book table (to make it easily re-runable)
CREATEs the book table.
INSERTs 3 books with the id not specified (typpical)
INSERTs a fourth book but with a specific id of 100
INSERTs another 3 books (not that these will be 101-103 as 100 is the highest id before the inserts)
INSERTs a last row BUT with a specific id of 10.
SELECTs all rows with all columns from the book table ordered, as no ORDER BY has been specified, according to the hidden index based upon the id. NOTE although id 10 was the last inserted it is the 4th row.
Result
In Python :-
conn = sqlite3.connect("books.db")
conn.execute("DROP TABLE IF EXISTS book")
conn.execute("CREATE TABLE IF NOT EXISTS book (id INTEGER PRIMARY KEY,title text, author text, year integer, isbn integer)")
conn.execute("INSERT INTO book (title,author, year, isbn) "
"VALUES('book1','The Author','1999','1234567890'), "
"('book2','Author 2','1899','2234567890'), "
"('book3','Author 3','1799','3234567890');")
conn.execute("INSERT INTO book VALUES (100,'book10','Author 10','1999','4234567890'); --<<<<<<<<<< specific ID")
conn.execute("INSERT INTO book (title,author, year, isbn) VALUES ('book11','Author 11','1999','1234567890'),('book12','Author 12','1899','2234567890'),('book13','Author 13','1799','3234567890');")
conn.execute("INSERT INTO book VALUES (10,'book10','Author 10','1999','4234567890'); --<<<<<<<<<< specific ID")
cur = conn.cursor()
cur.execute("SELECT * FROM book")
for each in cur:
print("{0:<20} {1:<20} {2:<20} {3:<20} {4:<20}".format(each[0],each[1],each[2],each[3],each[4]))
conn.commit()
conn.close()
The results in :-
1 book1 The Author 1999 1234567890
2 book2 Author 2 1899 2234567890
3 book3 Author 3 1799 3234567890
10 book10 Author 10 1999 4234567890
100 book10 Author 10 1999 4234567890
101 book11 Author 11 1999 1234567890
102 book12 Author 12 1899 2234567890
103 book13 Author 13 1799 3234567890
because you say I am inserting the data manually then instead of
def connect():
conn=sqlite3.connect("books.db")
cur=conn.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS book (id INTEGER PRIMARY KEY,
title text, author text, year integer, isbn integer)")
conn.commit()
conn.close()
Try to use
def connect():
conn=sqlite3.connect("books.db")
cur=conn.cursor()
cur.execute("SELECT * FROM books")
for row in cur:
print(row[0],row[1],row[2],row[3],row[4])
conn.commit()
conn.close()

Categories

Resources