How to query collection of objects while grouping by a property - python

So I am trying to write a query that will select a collection of objects that are distinct on a certain property.
Action:
+-----+-----------+-----+
| id | timestamp | ... |
+-----+-----------+-----+
| 10 | 16:04 | ... |
| 11 | 16:06 | ... |
| 12 | 16:08 | ... |
| 13 | 16:09 | ... |
| 14 | 16:10 | ... |
+-----+-----------+-----+
FooVersion:
+----+--------+-----------+-------------------+
| id | foo_id | action_id | foo_zab |
+----+--------+-----------+-------------------+
| 1 | 1 | 10 | xx |
| 2 | 2 | 11 | yy |
| 3 | 3 | 12 | zz |
| 4 | 3 | 13 | zy |
| 5 | 3 | 14 | zx |
+----+--------+-----------+-------------------+
Foo:
+----+-----+
| id | zab |
+----+-----+
| 1 | xx |
| 2 | yy |
| 3 | zx |
+----+-----+
A scene is made up of a collection of foos. I am trying to track the changes in each particular foo over time. Therefore, each time a change is made to foo, the action that caused that change is recorded and a copy of some of foo's properties are stored in the foo_versions table
What I am looking for is the "state of the foos at a particular action". So, while action #11 only specifically links to foo, the state of the scene at action #11 actually contains 3 foos, the versions of which are foo_version #1, #2, and #5
I need to construct a query that will say "for a specified action, give me the representation of the scene"
For action #10, the scene would be [<foo_version #1>]
For action #12, the scene would be [<foo_version #1>, <foo_version #2>, <foo_version #3>]
This is where it gets tricky. For action #14, the representation of the scene is [<foo_version #1>, <foo_version #2>, <foo_version #5>]. Foo versions #3, #4, and #5 all refer to the same foo. So, foo versions #3 and #4 are overwritten by #5.
I am using this sqlalchemy query:
stmt = db.session.query(Action).filter(Action.timestamp <= action.timestamp).subquery()
action_alias = aliased(Action, stmt)
foo_versions = db.session.query(FooVersion) \
.join(Action) \
.join(action_alias, FooVersion.action) \
.filter(Action.frame_id == frame.id) \
.all()
The result I am getting is
[<foo_version #1>, <foo_version #2>, <foo_version #3>, <foo_version #4>, <foo_version #5>]]]
|____________| |____________|
^ ^
| |
I need to get rid of these versions -> | |
since they have been overwritten

I am not familiar with python but here is the SQL query if I got correctly what you want:
SELECT a.id as action_id, timestamp,
v.id as version, foo_id, foo_zab
FROM action a
JOIN foo_version v
ON ( v.action_id IN
(
SELECT MAX(inner_v.action_id)
FROM foo_version inner_v
WHERE a.id >= inner_v.action_id
GROUP BY inner_v.foo_id
)
)
JOIN foo ON ( foo.id = v.foo_id )
ORDER BY a.id, version
It will give you each action row repeated with the data of each version
and here is a SQL fiddle demo

Related

PySpark - How to group by rows and then map them using custom function

Let's say I have table which would look like that
| id | value_one | type | value_two |
|----|-----------|------|-----------|
| 1 | 2 | A | 1 |
| 1 | 4 | B | 1 |
| 2 | 3 | A | 2 |
| 2 | 1 | B | 3 |
I know that there are only A and B types for specific ID, what I want to achieve is to group those two values and calculate new type using formula A/B, it should be applied to value_one and value_two, so table afterwards should look like:
| id | value_one | type | value_two|
|----|-----------| -----|----------|
| 1 | 0.5 | C | 1 |
| 2 | 3 | C | 0.66 |
I am new to PySpark, and as for now I wasn't able to achieve described result, would appreciate any tips/solutions.
You can consider dividing the original dataframe into two parts according to type, and then use SQL statements to implement the calculation logic.
df.filter('type = "A"').createOrReplaceTempView('tmp1')
df.filter('type = "B"').createOrReplaceTempView('tmp2')
sql = """
select
tmp1.id
,tmp1.value_one / tmp2.value_one as value_one
,'C' as type
,tmp1.value_two / tmp2.value_two as value_two
from tmp1 join tmp2 using (id)
"""
reuslt_df = spark.sql(sql)
reuslt_df.show(truncate=False)

Unstack (pivot?) dataframe in Pandas

I have a dataframe somewhat like this:
ID | Relationship | First Name | Last Name | DOB | Address | Phone
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891
1 | 2 | Spouse | Bulma | Saiyan | 04/20/1969 | Saiyan Planet | 123-456-7891
2 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321
3 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870
4 | 4 | Child | Gohan | Kakarot | 04/02/2001 | Planet Earth | 321-654-9870
5 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568
I'm looking to have the rows with same ID appended to the right of the first row with that ID.
Example:
ID | Relationship | First Name | Last Name | DOB | Address | Phone | Spouse_First Name | Spouse_Last Name | Spouse_DOB | Child_First Name | Child_Last Name | Child_DOB |
0 | 2 | Self | Vegeta | Saiyan | 01/01/1949 | Saiyan Planet | 123-456-7891 | Bulma | Saiyan | 04/20/1969 | | |
1 | 3 | Self | Krilin | Human | 08/21/1992 | Planet Earth | 789-456-4321 | | | | | |
2 | 4 | Self | Goku | Kakarot | 05/04/1975 | Planet Earth | 321-654-9870 | | | | Gohan | Kakarot | 04/02/2001 |
3 | 5 | Self | Freezer | Fridge | 09/15/1955 | Deep Space | 456-788-9568 | | | | | |
My real scenario dataframe has more columns, but they all have the same information when the two rows share the same ID, so no need to duplicate those in the other rows. I only need to add to the right the columns that I choose, which in this case would be First Name, Last Name, DOB with the identifier for the new column label depending on what's on the 'Relationship' column (I can rename them later if it's not possible to do in a straight way, just wanted to illustrate my point.
Now that I've said this, I want to add that I have tried different ways and seems like approaching with unstack or pivot is the way to go but I have not been successful in making it work.
Any help would be greatly appreciated.
This solution assumes that the DataFrame is indexed by the ID column.
not_self = (
df.query("Relationship != 'Self'")
.pivot(columns='Relationship')
.swaplevel(axis=1)
.reindex(
pd.MultiIndex.from_product(
(
set(df['Relationship'].unique()) - {'Self'},
df.columns.to_series().drop('Relationship')
)
),
axis=1
)
)
not_self.columns = [' '.join((a, b)) for a, b in not_self.columns]
result = df.query("Relationship == 'Self'").join(not_self)
Please let me know if this is not what was wanted.

Translating conditional RANK Window-Function from SQL to Pandas

I’m trying to translate a window-function from SQL to Pandas, which is only applied under the condition, that a match is possible – otherwise a NULL (None) value is inserted.
SQL-Code (example)
SELECT
[ID_customer]
[cTimestamp]
[TMP_Latest_request].[ID_req] AS [ID of Latest request]
FROM [table].[Customer] AS [Customer]
LEFT JOIN (
SELECT * FROM(
SELECT [ID_req], [ID_customer], [rTimestamp],
RANK() OVER(PARTITION BY ID_customer ORDER BY rTimestamp DESC) as rnk
FROM [table].[Customer_request]
) AS [Q]
WHERE rnk = 1
) AS [TMP_Latest_request]
ON [Customer].[ID_customer] = [TMP_Latest_request].[ID_customer]
Example
Joining the ID of the latest customer request (if exists) to the customer.
table:Customer
+-------------+------------+
| ID_customer | cTimestamp |
+-------------+------------+
| 1 | 2014 |
| 2 | 2014 |
| 3 | 2015 |
+-------------+------------+
table: Customer_request
+--------+-------------+------------+
| ID_req | ID_customer | rTimestamp |
+--------+-------------+------------+
| 1 | 1 | 2012 |
| 2 | 1 | 2013 |
| 3 | 1 | 2014 |
| 4 | 2 | 2014 |
+--------+-------------+------------+
Result: table:merged
+-------------+------------+----------------------+
| ID_customer | cTimestamp | ID of Latest request |
+-------------+------------+----------------------+
| 1 | 2014 | 3 |
| 2 | 2014 | 4 |
| 3 | 2015 | None/NULL |
+-------------+------------+----------------------+
What is the equivalent in Python Pandas?
Instead of using RANK() function, you can simply using the below, and it is easy to convert.
SELECT A.ID_Customer,A.cTimeStamp,B.ID_req
FROM Customer A
LEFT JOIN (
SELECT ID_Customer,MAX(ID_req)ID_req
FROM Customer_request
GROUP BY ID_Customer
)B
ON A.ID_Customer = B.ID_Customer
Try the following query, if you are facing any issues, ask me in the comments.

Maximizing a combination of a series of values

This is a complicated one, but I suspect there's some principle I can apply to make it simple - I just don't know what it is.
I need to parcel out presentation slots to a class full of students for the semester. There are multiple possible dates, and multiple presentation types. I conducted a survey where students could rank their interest in the different topics. What I'd like to do is get the best (or at least a good) distribution of presentation slots to students.
So, what I have:
List of 12 dates
List of 18 students
CSV file where each student (row) has a rating 1-5 for each date
What I'd like to get:
Each student should have one of presentation type A (intro), one of presentation type B (figures) and 3 of presentation type C (aims)
Each date should have at least 1 of each type of presentation
Each date should have no more than 2 of type A or type B
Try to give students presentations that they rated highly (4 or 5)
I should note that I realize this looks like a homework problem, but it's real life :-). I was thinking that I might make a Student class for each student that contains the dates for each presentation type, but I wasn't sure what the best way to populate it would be. Actually, I'm not even sure where to start.
TL;DR: I think you're giving your students too much choice :D
But I had a shot at this problem anyway. Pretty fun exercise actually, although some of the constraints were a little vague. Most of all, I had to guess what the actual students' preference distribution would look like. I went with uniformly distributed, independent variables, although that's probably not very realistic. Still I think it should work just as well on real data as it does on my randomly generated data.
I considered brute forcing it, but a rough analysis gave me an estimate of over 10^65 possible configurations. That's kind of a lot. And since we don't have a trillion trillion years to consider all of them, we'll need a heuristic approach.
Because of the size of the problem, I tried to avoid doing any backtracking. But this meant that you could get stuck; there might not be a solution where everyone only gets dates they gave 4's and 5's.
I ended up implementing a double-edged Iterative Deepening-like search, where both the best case we're still holding out hope for (i.e., assign students to a date they gave a 5) and the worst case scenario we're willing to accept (some student might have to live with a 3) are gradually lowered until a solution is found. If we get stuck, reset, lower expectations, and try again. Tasks A and B are assigned first, and C is done only after A and B are complete, because the constraints on C are far less stringent.
I also used a weighting factor to model the trade off between maximizing students happiness with satisfying the types-of-presentations-per-day limits.
Currently it seems to find a solution for pretty much every random generated set of preferences. I included an evaluation metric; the ratio between the sum of the preference values of all assigned student/date combos, and the sum of all student ideal/top 3 preference values. For example, if student X had two fives, one four and the rest threes on his list, and is assigned to one of his fives and two threes, he gets 5+3+3=11 but could ideally have gotten 5+5+4=14; he is 11/14 = 78.6% satisfied.
After some testing, it seems that my implementation tends to produce an average student satisfaction of around 95%, at lot better than I expected :) But again, that is with fake data. Real preferences are probably more clumped, and harder to satisfy.
Below is the core of the algorihtm. The full script is ~250 lines and a bit too long for here I think. Check it out at Github.
...
# Assign a date for a given task to each student,
# preferring a date that they like and is still free.
def fill(task, lowest_acceptable, spread_weight=0.1, tasks_to_spread="ABC"):
random_order = range(nStudents) # randomize student order, so everyone
random.shuffle(random_order) # has an equal chance to get their first pick
for i in random_order:
student = students[i]
if student.dates[task]: # student is already assigned for this task?
continue
# get available dates ordered by preference and how fully booked they are
preferred = get_favorite_day(student, lowest_acceptable,
spread_weight, tasks_to_spread)
for date_nr in preferred:
date = dates[date_nr]
if date.is_available(task, student.count, lowest_acceptable == 1):
date.set_student(task, student.count)
student.dates[task] = date
break
# attempt to "fill()" the schedule while gradually lowering expectations
start_at = 5
while start_at > 1:
lowest_acceptable = start_at
while lowest_acceptable > 0:
fill("A", lowest_acceptable, spread_weight, "AAB")
fill("B", lowest_acceptable, spread_weight, "ABB")
if lowest_acceptable == 1:
fill("C", lowest_acceptable, spread_weight_C, "C")
lowest_acceptable -= 1
And here is an example result as printed by the script:
Date
================================================================================
Student | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
================================================================================
1 | | A | B | | C | | | | | | | |
2 | | | | | A | | | | | B | C | |
3 | | | | | B | | | C | | A | | |
4 | | | | A | | C | | | | | | B |
5 | | | C | | | | A | B | | | | |
6 | | C | | | | | | | A | B | | |
7 | | | C | | | | | B | | | | A |
8 | | | A | | C | | B | | | | | |
9 | C | | | | | | | | A | | | B |
10 | A | B | | | | | | | C | | | |
11 | B | | | A | | C | | | | | | |
12 | | | | | | A | C | | | | B | |
13 | A | | | B | | | | | | | | C |
14 | | | | | B | | | | C | | A | |
15 | | | A | C | | B | | | | | | |
16 | | | | | | A | | | | C | B | |
17 | | A | | C | | | B | | | | | |
18 | | | | | | | C | A | B | | | |
================================================================================
Total student satisfaction: 250/261 = 95.00%

In mysql, is it possible to add a column based on values in one column?

I have a mysql table data which has following columns
+-------+-----------+----------+
|a | b | c |
+-------+-----------+----------+
| John | 225630096 | 447 |
| John | 225630118 | 491 |
| John | 225630206 | 667 |
| John | 225630480 | 1215 |
| John | 225630677 | 1609 |
| John | 225631010 | 2275 |
| Ryan | 154247076 | 6235 |
| Ryan | 154247079 | 6241 |
| Ryan | 154247083 | 6249 |
| Ryan | 154247084 | 6251 |
+-------+-----------+----------+
I want to add a column d based on the values in a and c (See expected table below). Values in a is the name of the subject, b is one of its attribute, and c another. So, if the values of c are within 15 units of each other for each subject assign them a same cluster number (for example, each value in c for Ryan is within 15 unit, so they all are assigned 1), but if not assign them a different value as in for John, where each row gets a different value for d.
+-------+-----------+----------+---+
|a | b | c |d |
+-------+-----------+----------+---+
| John | 225630096 | 447 | 1 |
| John | 225630118 | 491 | 2 |
| John | 225630206 | 667 | 3 |
| John | 225630480 | 1215 | 4 |
| John | 225630677 | 1609 | 5 |
| John | 225631010 | 2275 | 6 |
| Ryan | 154247076 | 6235 | 1 |
| Ryan | 154247079 | 6241 | 1 |
| Ryan | 154247083 | 6249 | 1 |
| Ryan | 154247084 | 6251 | 1 |
+-------+-----------+----------+---+
I am not sure if this could be done in mysql, but if not i would welcome any python based answers as well, in that case, working on this table as cdv format.
Thanks.
You could use a query with variables:
SELECT a, b, c,
CASE WHEN #last_a != a THEN #d:=1
WHEN (#last_a = a) AND (c>#last_c+15) THEN #d:=#d+1
ELSE #d END d,
#last_a := a,
#last_c := c
FROM
tablename, (SELECT #d:=1, #last_a:=null, #last_c:=null) _n
ORDER BY a, c
Please see fiddle here.
Explanation
I'm using a join between tablename and the subquery (SELECT ...) _n just to initialize some variables (d is initialized to 1, #last_a to null, #last_c to null).
Then, for every row, I'm checking if the last encountered a -the one on the previous row- is different than the current a: in that case set #d to 1 (and return it).
If the last encountered a is the same as the current row and c is greater than the last encountered c + 15, then increment #d and return its value.
Otherwise, just return d without incrementing it. This will happen when a has not changed and c is not greater than the previous c+15, or this will happen at the first row (because #last_a and #last_c have been initialized to null).
To make it work, we need to order by a and c.

Categories

Resources