I am trying to annotate a queryset with an aggregate of a subset of previous rows. Take the following example table of a player's score in a particular game, with the column, last_2_average_score being the rolling average from the previous two games score for a particular player.
+----------+-----------+---------+-------------------------+
| date | player | score | last_2_average_score |
+----------+-----------+---------+-------------------------+
| 12/01/19 | 1 | 10 | None |
| 12/02/19 | 1 | 9 | None |
| 12/03/19 | 1 | 8 | 9.5 |
| 12/04/19 | 1 | 7 | 8.5 |
| 12/05/19 | 1 | 6 | 7.5 |
+----------+-----------+---------+-------------------------+
In order to accomplish this, i wrote the following query, trying to annotate each "row" with the corresponding 2 game average for their score
ScoreModel.objects.annotate(
last_two_average_score=Subquery(
ScoreModel.objects.filter(
player=OuterRef("player"), date__lt=OuterRef("date")
)
.order_by("-date")[:2]
.annotate(Avg("score"))
.values("score__avg")[:1],
output_field=FloatField(),
)
)
This query however, does not output the correct result. In fact the result is just every record annotated with
{'last_two_average_score': None}
I have tried a variety of different combinations of the query, and cannot find the correct combination. Any advice that you can give would be much appreciated!
Instead of trying to address the problem from the ORM first, I ended up circling back and first trying to implement the query in raw SQL. This immediately lead me to the concept of WINDOW functions, which when I looked in Django's ORM for, found very quickly.
https://docs.djangoproject.com/en/3.0/ref/models/expressions/#window-functions
For this interested, the resulting query looks something like this, which was much simpler than what I was trying to accomplish with Subquery
ScoreModel.objects.annotate(
last_two_average=Window(
expression=Avg("score"),
partition_by=[F("player")],
order_by=[F("date").desc()],
frame=RowRange(start=-2, end=0),
)
)
Related
I have a database stored in a GridDB container. I need to filter out a column (of codes) and get all the rows where the value has "INB" in its code.
Like this:
'''
id | codes
----+---------------
0 | UIGTF0941H9RBD
1 | UIHGG83G31H9G3
2 | UIFH3442N2B9HD
3 | UI41BBINB2B52O
4 | UI20JUINBHN52N
5 | UI4207HRHIHCBC
Result:
id | codes
----+---------------
3 | UI41BBINB2B52O
4 | UI20JUINBHN52N
'''
I have already checked the SQL documentation on the official website for GriDB, I saw that the HAVING clause could be used. But I have no idea how to use it.
I have this kind of data set (originally comes from Excel so I past it like this in here). This is a ice hocket statistics
game_id | home_team_shots | home_team_saves | home_team_goals | away_team_shots | away_team_saves | away_team_goals
123 | 20 | 13 | 3 | 15 | 17 | 2
124 | 23 | 28 | 5 | 30 | 17 | 5
-----
There are about 15-20 rows like this depending on the exact dataset, I have about 50 of these excel.
Now I would need to find simple mathematical formulas from this data. Basically if in some column all values are less than X, equal to X or more than X. All the columns won't contain any such formula which is ok.
For example for the data above should be identified that away_team_saves == 17. And home_team_shots < 24. Of course all columns are not important so I should be able to define which columns to use.
I have mainly looking for solution in Python. I don't need an exact solution, just a push to the right direction. I am quite the beginner in Python and don't know where I should start looking.
All the help is very appreciated.
I have a dataframe full of stock transactions. I want to calculate how I can tell if a certain account is ordering less than they usually do.
NB: some cusotmers might place a big order (double the size of normal orders) and then take twice as long to order again. Or half the size and order double the frequency as normal.
Data layout:
SA_DACCNT = Account number
SA_TRREF = Invoice No
+-----------+------------+----------+------------+------------+
| SA_DACCNT | SA_TRDATE | SA_TRREF | SA_TRVALUE | SA_PRODUCT |
+-----------+------------+----------+------------+------------+
| GO63 | 01/05/2019 | 11587 | 0.98 | R613S/6 |
| GO63 | 01/05/2019 | 11587 | 0.98 | R614S/6 |
| AAA1 | 01/05/2019 | 11587 | 1.96 | R613S/18 |
| GO63 | 01/05/2019 | 11587 | 2.5 | POST3 |
| DOD2 | 01/07/2019 | 11588 | 7.84 | R613S/18 |
+-----------+------------+----------+------------+------------+
So Far:
I've tried to group by customer and resample the columns into quarters and analyse the z score of the last quarter for each customer, but it doesn't always work out correctly as not everyone orders quarterly. So there are gaps which skew the figures.
I've also tired using fitting a linear model to the figures for each customer and using the coefficient as a metric to determine if spend is down. But it still doesn't look right.
I've been looking at is_monotonic_increasing and other such functions but still unable to find what I'm looking for, i'm sure there is a statistical technique for it but it aludes me.
I'm pretty new to all this and am left scratching my head on hows best to tackle it.
My goal is to determine WHO has spent less (over their last few orders) than they usually do.
Bascially who do we need to call / chase / send offers etc.
Any ideas on the correct way to analyse this would be appreaciated. Not looking to copy someones code, I'm sure i can work that bit out myself.
EDIT:
I've also now tried using diff to calcualte the average distace between orders and resampling to this value, then calculating the Z score. But still its not what I expect.
I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.
I know the mysql database is not in any particular oder, but I need it to behave in an ordered way. I have a database, example below, which currently seems to be in the order of column 1 (jobID, auto increment, primary key). What I need to do is to to change the order by, for example, moving the 3rd row up one position, so essentially changing the position of the 3rd and 2nd row, but I am unsure of how to do this. The reason is I am accessing this database via a python app which is grabbing jobs from a list, and I need to change the priority order sometimes. What would be the best way to do this please?
+-------+---------+----------+---------+--------+---------------------+
| jobID | location| mode | process | status | submitTime |
+-------+---------+----------+---------+--------+---------------------+
| 1 | /let111/| Verify | 1 | Failed | 2014-06-25 12:24:38 |
| 2 | /let114/| Verify | 1 | Passed | 2014-06-25 12:37:31 |
| 3 | /let112/| Verify | 1 | Failed | 2014-06-25 14:48:55 |
| 4 | /let117/| Verify | 2 | Passed | 2014-06-25 14:49:01 |
| 5 | /let113/| Verify | 2 | Passed | 2014-06-25 14:49:13 |
+-------+---------+----------+---------+--------+---------------------+
If you want to order by status first and then time, use an order by clause:
select t.*
from table t
order by status, submitTime;
In general, tables in relational databases in general and MySQL in particular have no particular order because a table models an unordered set of tuples. You can get results back in a particular order by using order by.
Note that the above is just one possibility that accomplishes what you are asking for. What you really want might involve other columns.