I know the mysql database is not in any particular oder, but I need it to behave in an ordered way. I have a database, example below, which currently seems to be in the order of column 1 (jobID, auto increment, primary key). What I need to do is to to change the order by, for example, moving the 3rd row up one position, so essentially changing the position of the 3rd and 2nd row, but I am unsure of how to do this. The reason is I am accessing this database via a python app which is grabbing jobs from a list, and I need to change the priority order sometimes. What would be the best way to do this please?
+-------+---------+----------+---------+--------+---------------------+
| jobID | location| mode | process | status | submitTime |
+-------+---------+----------+---------+--------+---------------------+
| 1 | /let111/| Verify | 1 | Failed | 2014-06-25 12:24:38 |
| 2 | /let114/| Verify | 1 | Passed | 2014-06-25 12:37:31 |
| 3 | /let112/| Verify | 1 | Failed | 2014-06-25 14:48:55 |
| 4 | /let117/| Verify | 2 | Passed | 2014-06-25 14:49:01 |
| 5 | /let113/| Verify | 2 | Passed | 2014-06-25 14:49:13 |
+-------+---------+----------+---------+--------+---------------------+
If you want to order by status first and then time, use an order by clause:
select t.*
from table t
order by status, submitTime;
In general, tables in relational databases in general and MySQL in particular have no particular order because a table models an unordered set of tuples. You can get results back in a particular order by using order by.
Note that the above is just one possibility that accomplishes what you are asking for. What you really want might involve other columns.
Related
I have a database stored in a GridDB container. I need to filter out a column (of codes) and get all the rows where the value has "INB" in its code.
Like this:
'''
id | codes
----+---------------
0 | UIGTF0941H9RBD
1 | UIHGG83G31H9G3
2 | UIFH3442N2B9HD
3 | UI41BBINB2B52O
4 | UI20JUINBHN52N
5 | UI4207HRHIHCBC
Result:
id | codes
----+---------------
3 | UI41BBINB2B52O
4 | UI20JUINBHN52N
'''
I have already checked the SQL documentation on the official website for GriDB, I saw that the HAVING clause could be used. But I have no idea how to use it.
I am trying to annotate a queryset with an aggregate of a subset of previous rows. Take the following example table of a player's score in a particular game, with the column, last_2_average_score being the rolling average from the previous two games score for a particular player.
+----------+-----------+---------+-------------------------+
| date | player | score | last_2_average_score |
+----------+-----------+---------+-------------------------+
| 12/01/19 | 1 | 10 | None |
| 12/02/19 | 1 | 9 | None |
| 12/03/19 | 1 | 8 | 9.5 |
| 12/04/19 | 1 | 7 | 8.5 |
| 12/05/19 | 1 | 6 | 7.5 |
+----------+-----------+---------+-------------------------+
In order to accomplish this, i wrote the following query, trying to annotate each "row" with the corresponding 2 game average for their score
ScoreModel.objects.annotate(
last_two_average_score=Subquery(
ScoreModel.objects.filter(
player=OuterRef("player"), date__lt=OuterRef("date")
)
.order_by("-date")[:2]
.annotate(Avg("score"))
.values("score__avg")[:1],
output_field=FloatField(),
)
)
This query however, does not output the correct result. In fact the result is just every record annotated with
{'last_two_average_score': None}
I have tried a variety of different combinations of the query, and cannot find the correct combination. Any advice that you can give would be much appreciated!
Instead of trying to address the problem from the ORM first, I ended up circling back and first trying to implement the query in raw SQL. This immediately lead me to the concept of WINDOW functions, which when I looked in Django's ORM for, found very quickly.
https://docs.djangoproject.com/en/3.0/ref/models/expressions/#window-functions
For this interested, the resulting query looks something like this, which was much simpler than what I was trying to accomplish with Subquery
ScoreModel.objects.annotate(
last_two_average=Window(
expression=Avg("score"),
partition_by=[F("player")],
order_by=[F("date").desc()],
frame=RowRange(start=-2, end=0),
)
)
Referring to here who recommends Join to append column from one table to another. I have been using this method indeed, but now reach some limitation for huge list of tables and rows
Let's say I have a dataframe of M features id, salary, age, etc.
+----+--------+------------+--------------+
| id | salary | age | zone | ....
+----+--------+------------+--------------+
I have perform certain operations on each feature to arrive at something like this
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
| id | salary | bin_salary | start_salary | end_salary | count_salary | stat1_salary | stat2_slaary |
+----+--------+------------+--------------+------------+--------------+--------------+--------------+
Each feature is processed independently, with the same list of rows
+----+--------+------------+--------------+------------+
| id | salary | stat1_salary | stat2_salary | stat3_salary|
+----+--------+------------+--------------+------------+
| 301 | x1 | x | x | x |
| 302 | null | x | x | x |
| 303 | x3 | x | x | x |
+----+--------+------------+--------------+
| id | age | stat1_age | stat2_age
+----+--------+------------+--------------+
| 301 | null | x | x
| 302 | x2 | x | x
| 303 | x3 | x | x
In the end, I would like to combine them into the final dataframe with all attributes of each features, by joining on unique ID of effectively hundreds to thousand of table, each for one feature. This final dataframe is my feature vector
| id | salary | stat1_salary | stat2_salary | stat3_salary| age | stat1_age | stat2_age
I am hitting some Memory limit that cause Out Of Memory exception. Raising executor and driver memory seems to only be a temporary solution, and limited by admin.
JOIN is expensive and limited by resource in pyspark, and I wonder if it's possible to pre-sort each feature table independently, then keep that order and just APPEND the entire column next to one another instead of performing expensive JOIN. I can manage to keep all the same list of rows for each feature table. I hope to have no join nor lookup because my set of Id is the same.
How is it achievable ? As far as I understand, even if I sort each table by Id, Spark distribute them for storage and the retrieval (if I want to query back to append) does not guarantee to have that same order.
There doesn't seem to be a spark function to append a column from one DF to another directly except 'join'.
If you are starting from only one dataframe and trying to generate new features from each original column of the dataframe.
I would suggest to use 'pandas_udf', where the new features can be appended in the 'udf' for all the original columns.
This will avoid using 'join' at all.
To control the memory usage, choose the 'group' column where we make sure that each group is within executor memory specification.
I am using Django query to filter out some transactions from table where one transaction might have multiple entries in the table.
E.g. Sample table
+---------------+---------+
| TransactionId | Status |
+---------------+---------+
| Txn0 | Pending |
| Txn0 | Success |
| Txn1 | Fail |
| Txn2 | Pending |
| Txn3 | Fail |
| Txn4 | Pending |
| Txn4 | Fail |
| Txn5 | Pending |
+---------------+---------+
Current query :
SampleTable.objects.exclude(status='Fail').exclude(status='Success')
My current query returns Txn0, Txn2,Txn4, Txn5( because these are marked pending).
I need a queryset which return only row Txn2, Txn5( because all other transaction has atleast one Fail or Success transaction).
Also, tried using .distinct() but that didn't help.
Well, not a straight one. But, first filter all the transactions(row) with Failed/success status.
success_fail_txns = SampleTable.objects.filter(status='Success')
| SampleTable.objects.filter(status='Fail')
pending_txns = SampleTable.objects.exclude(txnid__in=success_fail_txns)
you can use exclude with in condition
SampleTable.objects.exclude(status__in=['Fail','Success'])
Let's assume I have the following session data split over two workers:
+---------+-----------+--------+
| user_id | timestamp | worker |
+---------+-----------+--------+
| 1 | 2 | 1 |
| 1 | 1 | 1 |
| 1 | 21 | 1 |
| 1 | 23 | 2 |
| 1 | 24 | 2 |
+---------+-----------+--------+
I want to make sure that if I agreggate the data, I will get the following result:
+---------+---------------+-------------+---------------+
| user_id | session_start | session_end | element_count |
+---------+---------------+-------------+---------------+
| 1 | 1 | 2 | 2 |
| 1 | 21 | 24 | 3 |
+---------+---------------+-------------+---------------+
And not the following result:
+---------+---------------+-------------+---------------+
| user_id | session_start | session_end | element_count |
+---------+---------------+-------------+---------------+
| 1 | 1 | 2 | 2 |
| 1 | 21 | 21 | 1 |
| 1 | 23 | 24 | 2 |
+---------+---------------+-------------+---------------+
For example in this use case, I would have to partition the input by user id to have all the session data for a user before splitting it into the session windows.
My questions would be:
Does Apache Beam automatically make sure that each worker has all the data?
For example, does this code snippet make sure that all user sessions are
agreggated correctly, even if the data was split over multiple workers?
import apache_beam as beam
from apache_beam.transforms.window import Session
(pcoll
| beam.WindowInto(Session(gap_size=10))
| beam.GroupByKey())
If so, how exactly does it guarantee this behaviour?
How can I replicate this specific situation locally so I can test it?
Were can I read more about this behaviour in the documentation?
Does Apache Beam automatically make sure that each worker has all the data?
For example, does this code snippet make sure that all user sessions are agreggated correctly, even if the data was split over multiple workers?
The Apache Beam semantics does not depend on which data is processed by which worker. It is up to a particular Beam runner to guarantee that it executes the pipeline correctly. If some runner needs to gather all the data on one worker to do it - then that runner should do that; if the runner can guarantee correctness without doing that - even better.
If so, how exactly does it guarantee this behaviour?
The exact implementation details of efficiently evaluating windowing and triggers are very complex. You can look at the java implementation in ReduceFnRunner. But at a high level, you can think of it like this (this is just a mental model, not an exact implementation):
The Window transform assigns each element into one or more windows based on its timestamp.
The GroupByKey transform implicitly groups by both key and window, i.e. for every key/window pair it maintains a buffer of elements with this key and window.
If the windowfn is "merging" (i.e. if the windows of different elements with the same key are not independent - like is the case with session windows), it recomputes the correct set of windows for all elements with this key, and remaps each element into the respective new windows. (mergeWindows)
How can I replicate this specific situation locally so I can test it?
Use the direct runner that runs the pipeline in-memory locally, and provides a reference implementation of the Beam model.