SQLite How to Group data according to custom rules - python

I have a table as below.
If I do a group by operation on the name field, the key of b is 11,but what I need to leave is 12, because 12 has already appeared in other records.
What should I do to achieve this result, without using the max aggregation method
Introduce the meaning of the table,
key-12 also provides name-a, name-b, key-11 only provides name-b
For name-c, there are three key that can provide name-c, all of which are not repeated
|name|key|
| a | 12 |
| b | 11 |
| b | 12 |
| c | 15 |
| c | 14 |
| c | 17 |
....
What I hope is that the result obtained through group by is:
|name|key |
| a | 12 |
| b | 12 |
| c | 15 |
To perform group by operations through the name field,
b needs to leave key-12, because key-12 provides both name-a and name-b,
so key-11 is not needed.
For name-c, there are three key that can provide name-c, all of which are not repeated, so let's just use the one that appears for the first time.

Assuming that key is unique for each name you can use COUNT() and ROW_NUMBER() window functions:
select name, key
from (
select *, row_number() over (partition by name order by counter desc, rowid) rn
from (
select *, rowid, count(*) over (partition by key) counter
from tablename
)
)
where rn = 1
See the demo.
Results:
> name | key
> :--- | --:
> a | 12
> b | 12
> c | 15

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

Django queryset - Add HAVING constraint after annotate(F())

I had a seemingly normal situation with adding HAVING to the query.
I read here and here, but it did not help me
I need add HAVING to my query
MODELS :
class Package(models.Model):
status = models.IntegerField()
class Product(models.Model):
title = models.CharField(max_length=10)
packages = models.ManyToManyField(Package)
Products:
|id|title|
| - | - |
| 1 | A |
| 2 | B |
| 3 | C |
| 4 | D |
Packages:
|id|status|
| - | - |
| 1 | 1 |
| 2 | 2 |
| 3 | 1 |
| 4 | 2 |
Product_Packages:
|product_id|package_id|
| - | - |
| 1 | 1 |
| 2 | 1 |
| 2 | 2 |
| 3 | 2 |
| 2 | 3 |
| 4 | 3 |
| 4 | 4 |
visual
pack_1 (A, B) status OK
pack_2 (B, C) status not ok
pack_3 (B, D) status OK
pack_4 (D) status not ok
My task is to select those products that have the latest package in status = 1
Expected result is : A, B
my query is like this
SELECT prod.title, max(tp.id)
FROM "product" as prod
INNER JOIN "product_packages" as p_p ON (p.id = p_p.product_id)
INNER JOIN "package" as pack ON (pack.id = p_p.package_id)
GROUP BY prod.title
HAVING pack.status = 1
it returns exactly what I needed
|title|max(pack.id)|
| - | - |
| A | 1 |
| B | 3 |
BUT my orm does not work correctly
I try like this
p = Product.objects.values('id').annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1).values('id', 'pack_id')
p.query
SELECT "product"."id", MAX("product_packages"."package_id") AS "pack_id"
FROM "product" LEFT OUTER JOIN "product_packages" ON ("product"."id" = "product_packages"."product_id") LEFT OUTER JOIN "package" ON ("product_packages"."package_id" = "package"."id")
WHERE "package"."status" = 1
GROUP BY "product"."id"
please help me to make correct ORM
How the query looks like when you remove
.values('id', 'pack_id')
At the end?
If I remember correctly then:
p = Product.objects.values('id').annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1)
and
p = Product.objects.annotate(pack_id = Max('packages')).annotate(my_status = F('packages__status')).filter(my_status=1).values('id')
Will result with different queries
Haki Benita has an excellent site that is made for Database Gurus and how to make to most of Django.
You can take a look at this post:
https://hakibenita.com/django-group-by-sql#how-to-use-having
Django has a very specific way of adding the "HAVING" operator, i.e. your query set needs to be structured so that your annotation is followed by a 'values' call to single out the column you want to group by, then annotate the Max or whatever aggregate you want.
Also this annotation seems like it won't work annotate(my_status = F('packages__status') you want to annotate multiple status to a single annotation.
You might want to try a subquery to annotate the way you want.
e.g.
Product.objects.annotate(
latest_pack_id=Subquery(
Package.objects.order_by('-pk').filter(status=1).values('pk')[:1]
)
).filter(
packages__in=F('latest_pack_id')
)
Or something along those lines, I haven't tested this out
I think you can try like this with subquery:
from django.db.models import OuterRef, Subquery
sub_query = Package.objects.filter(product=OuterRef('pk')).order_by('-pk')
products = Product.objects.annotate(latest_package_status=Subquery(sub_query.values('status')[0])).filter(latest_package_status=1)
Here first I am preparing the subquery by filtering the Package model with Product's primary key and ordering them by Package's primary key. Then I took the latest value from the subquery and annotating it with Product queryset, and filtering out the status with 1.

Translating conditional RANK Window-Function from SQL to Pandas

I’m trying to translate a window-function from SQL to Pandas, which is only applied under the condition, that a match is possible – otherwise a NULL (None) value is inserted.
SQL-Code (example)
SELECT
[ID_customer]
[cTimestamp]
[TMP_Latest_request].[ID_req] AS [ID of Latest request]
FROM [table].[Customer] AS [Customer]
LEFT JOIN (
SELECT * FROM(
SELECT [ID_req], [ID_customer], [rTimestamp],
RANK() OVER(PARTITION BY ID_customer ORDER BY rTimestamp DESC) as rnk
FROM [table].[Customer_request]
) AS [Q]
WHERE rnk = 1
) AS [TMP_Latest_request]
ON [Customer].[ID_customer] = [TMP_Latest_request].[ID_customer]
Example
Joining the ID of the latest customer request (if exists) to the customer.
table:Customer
+-------------+------------+
| ID_customer | cTimestamp |
+-------------+------------+
| 1 | 2014 |
| 2 | 2014 |
| 3 | 2015 |
+-------------+------------+
table: Customer_request
+--------+-------------+------------+
| ID_req | ID_customer | rTimestamp |
+--------+-------------+------------+
| 1 | 1 | 2012 |
| 2 | 1 | 2013 |
| 3 | 1 | 2014 |
| 4 | 2 | 2014 |
+--------+-------------+------------+
Result: table:merged
+-------------+------------+----------------------+
| ID_customer | cTimestamp | ID of Latest request |
+-------------+------------+----------------------+
| 1 | 2014 | 3 |
| 2 | 2014 | 4 |
| 3 | 2015 | None/NULL |
+-------------+------------+----------------------+
What is the equivalent in Python Pandas?
Instead of using RANK() function, you can simply using the below, and it is easy to convert.
SELECT A.ID_Customer,A.cTimeStamp,B.ID_req
FROM Customer A
LEFT JOIN (
SELECT ID_Customer,MAX(ID_req)ID_req
FROM Customer_request
GROUP BY ID_Customer
)B
ON A.ID_Customer = B.ID_Customer
Try the following query, if you are facing any issues, ask me in the comments.

Extract MIN MAX of SQL table in python

I have a db with the following info:
+----+------+-----+-----+------+------+-----------+
| id | time | lat | lon | dep | dest | reg |
+----+------+-----+-----+------+------+-----------+
| a | 1 | 10 | 20 | home | work | alpha |
| a | 6 | 11 | 21 | home | work | alpha |
| a | 11 | 12 | 22 | home | work | alpha |
| b | 2 | 70 | 80 | home | cine | beta |
| b | 8 | 70 | 85 | home | cine | beta |
| b | 13 | 70 | 90 | home | cine | beta |
+----+------+-----+-----+------+------+-----------+
Is it possible to extract the following info:
+----+------+------+----------+----------+----------+----------+------+------+--------+
| id | tmin | tmax | lat_tmin | lon_tmin | lat_tmax | lon_tmax | dep | dest | reg |
+----+------+------+----------+----------+----------+----------+------+------+--------+
| a | 1 | 11 | 10 | 20 | 12 | 22 | home | work | alpha |
| b | 2 | 13 | 70 | 80 | 70 | 90 | home | cine | beta |
+----+------+------+----------+----------+----------+----------+------+------+--------+
what if dep & dest were varying - how would tou select them?
Thks
You could use a window function for that:
SELECT t0.id,
t0.time as tmin, t1.time as tmax,
t0.lat as lat_tmin, t1.lat as lat_tmax,
t0.lon as lon_tmin, t1.lon as lon_tmax,
t0.dep,
t0.dest,
t0.reg
FROM ( SELECT *,
row_number() over (partition by id order by time asc) as rn
FROM t) AS t0
INNER JOIN ( SELECT *,
row_number() over (partition by id order by time desc) as rn
FROM t) AS t1
ON t0.id = t1.id
WHERE t0.rn = 1
AND t1.rn = 1
This will return the data from the first and last row for each id, when sorted by id, time.
The values for dep, dest and reg are taken from the first row (per id) only.
If you want to also have separate rows for when -- for the same id -- you have different values of dep or dest, then just add those in the partition by clauses. All depends on which output you expect in that case:
SELECT t0.id,
t0.time as tmin, t1.time as tmax,
t0.lat as lat_tmin, t1.lat as lat_tmax,
t0.lon as lon_tmin, t1.lon as lon_tmax,
t0.dep,
t0.dest,
t0.reg
FROM ( SELECT *,
row_number() over (partition by id, dep, dest, reg
order by time asc) as rn
FROM t) AS t0
INNER JOIN ( SELECT *,
row_number() over (partition by id, dep, dest, reg
order by time desc) as rn
FROM t) AS t1
ON t0.id = t1.id
WHERE t0.rn = 1
AND t1.rn = 1
Please consider using a different name for the column time because of this remark in the documentation:
The following keywords could be reserved in future releases of SQL Server as new features are implemented. Consider avoiding the use of these words as identifiers.
... TIME ...
select min(t.time) as tmin, max(t.time) as tmax, (...) from table_name t group by t.dest
(...) just repeat for the other columns

Finding duplicates sets of records sharing the same Many-To-Many relations

I use Pandas for pre-processing of CSV dataset and convert it to a SQLite database.
I have a many-to-many relation between two entities A and B, represented by a junction DataFrame A2B.columns == ['AId', 'BId']. The uniqueness constraint on As is that each A have a different relation to Bs.
I want to efficiently remove duplicates As based on this constraint. I do it with Pandas like this :
AId_dedup = A2B.groupby('AId').BId.apply(tuple).drop_duplicates().index
The conversion to tuples allow the comparison of BIds collections related to each AId.
The relation A2B can be seen as a (sparse boolean) matrix, with 1s where a link exists between A and B. I want to remove duplicated rows of this matrix, alas pd.unstack() can't generate sparse matrices. (Also it require efficient hashing of rows)
My questions are :
What am I trying to do ? In terms of relational algebra ?
Can it be done more efficiently with Pandas or SQL and (preferably with the SQLite) engine ?
I want to use this operation to find synonyms (duplicate objects) in biological networks where the interactions are represented as tables.
Edit : Here is an example of what I want:
+-----+-----+
| Aid | Bid |
+-----+-----+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 1 |
| 2 | 2 |
| 2 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 4 |
+-----+-----+
A2B = A2B.groupby('AId').BId.apply(tuple)
+-----+-----------+
| Aid | Bid |
+-----+-----------+
| 1 | (1,2,3) |
| 2 | (1,2,3) |
| 3 | (1,2,3,4) |
+-----+-----------+
A2B = A2B.drop_duplicates()
+-----+-----------+
| Aid | Bid |
+-----+-----------+
| 1 | (1,2,3) |
| 3 | (1,2,3,4) |
+-----+-----------+
Back to junction table (not that easy in Pandas):
+-----+-----+
| Aid | Bid |
+-----+-----+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 3 | 1 |
| 3 | 2 |
| 3 | 3 |
| 3 | 4 |
+-----+-----+
if you can recreate your A2B table then: create new with unique constrain on ('AId', 'BId'), and then insert data like this:
insert into new_A2B select distinct AId, BId from A2B;
then do new inserts by sqlite with ON CONFLICT clause like this:
insert or ignore into new_A2B values (aid, bid);
if you can't recreate your A2B table, then use distinct when you select rows from it.
Edit:
You can find duplicates id by this query:
select A2B.aid, dup.aid
from A2B
left join A2B as dup on dup.bid = A2B.bid
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)
if you need, you can add where condition to find duplicates only for lower id like this
where A2B.aid < dup.aid
Maybe this query will be faster:
with
c as (select aid, count(1) as c
from A2B
group by aid)
select A2B.aid, dup.aid
from A2B
inner join c as ac on ac.aid = A2B.aid
left join A2B as dup on dup.bid = A2B.bid and A2B.aid < dup.aid
and exists(select 1 from c where aid = dup.aid and c = ac.c)
group by A2B.aid, dup.aid
having count(A2B.bid) = count(dup.bid)
and count(A2B.bid) = (select count(bid) from A2B where aid = dup.aid)
Edit:
There is one more solution that you can test (it is possible, that this is fastest query):
with
c as (select aid, min(bid) as f, max(bid) as l, count(1) as c
--, sum(bid) as s
from A2B
group by aid)
select f.aid, dup.aid
from c as f inner join c as dup
on f.aid < dup.aid and f.f = dup.f and f.l - dup.l and f.c = dup.c
--and f.s = dup.s
Where f.c = (
select count(1)
where A2B as t1
inner join A2B as t2
on t1.aid < t2.aid and t1.bid = t2.bid and t1.aid = f.aid and t2.aid = dup.aid)
you can also try uncomment sum(bid) as s & and f.s = dup.s

Categories

Resources