Finding matching mathematical formulas in dataset - python

I have this kind of data set (originally comes from Excel so I past it like this in here). This is a ice hocket statistics
game_id | home_team_shots | home_team_saves | home_team_goals | away_team_shots | away_team_saves | away_team_goals
123 | 20 | 13 | 3 | 15 | 17 | 2
124 | 23 | 28 | 5 | 30 | 17 | 5
-----
There are about 15-20 rows like this depending on the exact dataset, I have about 50 of these excel.
Now I would need to find simple mathematical formulas from this data. Basically if in some column all values are less than X, equal to X or more than X. All the columns won't contain any such formula which is ok.
For example for the data above should be identified that away_team_saves == 17. And home_team_shots < 24. Of course all columns are not important so I should be able to define which columns to use.
I have mainly looking for solution in Python. I don't need an exact solution, just a push to the right direction. I am quite the beginner in Python and don't know where I should start looking.
All the help is very appreciated.

Related

Check to see if a dataframe increases to a corner

I have a numerous dataframes that looks something like this:
| first_column | second_column | third_column |
| ------------ | ------------- | ------------ |
| 25 | 20 | 18 |
| 20 | 21 | 18 |
| 17 | 18 | 16 |
I want to see how well they converge to a corner (i.e., how well the values are increasing as you go to a corner of the dataframe). I've tried a couple of different methods, such as using df.diff() < 0 to see if the values are are decreasing as you go down the column and then converting the True or False statements to integers of 1 or 0 and summing those up, then doing the same thing but with the dataframe transposed and adding both sums together, but that doesn't seem to give me the correct best dataframes that converge to a corner.
I know this is oddly specific, but is there anyway to best see if the values of a dataframe increase as you go to a corner of the dataframe?

Python dataframes - Finding the highest value from 2 or more different rows

Just trying to figure out how to get the max() from 2 different rows in the same column.
For instance in my dataframe im trying to find the highest score from A and B, or A and C - how would I go about doing that? Also would like to learn how to find the max() from a range of rows (A to E) if someone could provid some insight.
Date | Class | High | Low
2/1/2021 | A | 10 | 4
2/1/2021 | B | 23 | 7
2/1/2021 | C | 11 | 8
2/1/2021 | D | 14 | 1
2/1/2021 | E | 12 | 11
Cheers!
To find the max value in a row you can use argmax()
>>>df['High'].argmax()
1
or
>>>df.High.values[0:4].max() # [0:4] is the range
23
or you can use a simple if else to compare two rows, for example:
if df.High.values[0]>df.High.values[1]:
print(df.High.values[0])
else:
print(df.High.values[1])
Actually went a different way for anyone who might need my solution.
a = df.loc[(df['Class'] == 'A') | (df['Class'] == 'B')]
abhigh = a.groupby(a['Date'], as_index=False)['High'].max()
Pretty much used .loc to create a database for all class labelled A and B
Then used group by to find the high score for each day.
I didnt clarify in my question but my dataframe has the class scores everyday so i needed a way to calculate the highet value for each day.

Django Subquery Subset of Previous Records

I am trying to annotate a queryset with an aggregate of a subset of previous rows. Take the following example table of a player's score in a particular game, with the column, last_2_average_score being the rolling average from the previous two games score for a particular player.
+----------+-----------+---------+-------------------------+
| date | player | score | last_2_average_score |
+----------+-----------+---------+-------------------------+
| 12/01/19 | 1 | 10 | None |
| 12/02/19 | 1 | 9 | None |
| 12/03/19 | 1 | 8 | 9.5 |
| 12/04/19 | 1 | 7 | 8.5 |
| 12/05/19 | 1 | 6 | 7.5 |
+----------+-----------+---------+-------------------------+
In order to accomplish this, i wrote the following query, trying to annotate each "row" with the corresponding 2 game average for their score
ScoreModel.objects.annotate(
last_two_average_score=Subquery(
ScoreModel.objects.filter(
player=OuterRef("player"), date__lt=OuterRef("date")
)
.order_by("-date")[:2]
.annotate(Avg("score"))
.values("score__avg")[:1],
output_field=FloatField(),
)
)
This query however, does not output the correct result. In fact the result is just every record annotated with
{'last_two_average_score': None}
I have tried a variety of different combinations of the query, and cannot find the correct combination. Any advice that you can give would be much appreciated!
Instead of trying to address the problem from the ORM first, I ended up circling back and first trying to implement the query in raw SQL. This immediately lead me to the concept of WINDOW functions, which when I looked in Django's ORM for, found very quickly.
https://docs.djangoproject.com/en/3.0/ref/models/expressions/#window-functions
For this interested, the resulting query looks something like this, which was much simpler than what I was trying to accomplish with Subquery
ScoreModel.objects.annotate(
last_two_average=Window(
expression=Avg("score"),
partition_by=[F("player")],
order_by=[F("date").desc()],
frame=RowRange(start=-2, end=0),
)
)

Are sales values decreasing over time? Pandas

I have a dataframe full of stock transactions. I want to calculate how I can tell if a certain account is ordering less than they usually do.
NB: some cusotmers might place a big order (double the size of normal orders) and then take twice as long to order again. Or half the size and order double the frequency as normal.
Data layout:
SA_DACCNT = Account number
SA_TRREF = Invoice No
+-----------+------------+----------+------------+------------+
| SA_DACCNT | SA_TRDATE | SA_TRREF | SA_TRVALUE | SA_PRODUCT |
+-----------+------------+----------+------------+------------+
| GO63 | 01/05/2019 | 11587 | 0.98 | R613S/6 |
| GO63 | 01/05/2019 | 11587 | 0.98 | R614S/6 |
| AAA1 | 01/05/2019 | 11587 | 1.96 | R613S/18 |
| GO63 | 01/05/2019 | 11587 | 2.5 | POST3 |
| DOD2 | 01/07/2019 | 11588 | 7.84 | R613S/18 |
+-----------+------------+----------+------------+------------+
So Far:
I've tried to group by customer and resample the columns into quarters and analyse the z score of the last quarter for each customer, but it doesn't always work out correctly as not everyone orders quarterly. So there are gaps which skew the figures.
I've also tired using fitting a linear model to the figures for each customer and using the coefficient as a metric to determine if spend is down. But it still doesn't look right.
I've been looking at is_monotonic_increasing and other such functions but still unable to find what I'm looking for, i'm sure there is a statistical technique for it but it aludes me.
I'm pretty new to all this and am left scratching my head on hows best to tackle it.
My goal is to determine WHO has spent less (over their last few orders) than they usually do.
Bascially who do we need to call / chase / send offers etc.
Any ideas on the correct way to analyse this would be appreaciated. Not looking to copy someones code, I'm sure i can work that bit out myself.
EDIT:
I've also now tried using diff to calcualte the average distace between orders and resampling to this value, then calculating the Z score. But still its not what I expect.

Performant alternative to constructing a dataframe by applying repeated pivots

I have a dataframe which contains a whole set of data and relevant id information:
| sample_id | run_id | x | y | z |
| 0 | 1 | 1 | 2 | 3 |
| 0 | 2 | 4 | 5 | 6 |
| 1 | 1 | 1 | 2 | 3 |
| 1 | 2 | 7 | 8 | 9 |
I wish to create a dataframe based on results from this. So a simple example would be my new dataframe should contain a row with the average information from a sample run:
| sample_id | avg_x | avg_y | avg_z |
| 0 | 2.5 | 3.5 | 4.5 |
| 1 | 4 | 5 | 6 |
At the moment I do this with a loop:
pivots = []
for i in samples:
df_sample = df_samples[df_samples['sample_id'] == i]
pivot = df_sample.pivot_table(index=index, columns='run_id', values=[x, y, z], aggfunc='mean')
# Add some other features. Involves creating more columns than existed in the initial df_samples dataframe
pivots.append(pivot)
# create new dataframe
pd.concat(pivots)
So my first question is, if I wanted to create a new dataframe which consists of repeated pivots of another dataframe. Is there a way to do that all at once with one pivot command instead of having to call it iteratively? If there is, is it more performant?
My second question involves the more complicated case. If it is possible to perform multiple pivots at once to build up the new dataframe when the new dataframe also will increase its dimensions i.e. it might look like
| s_id | avg_x | avg_y | avg_z | new_feature_1 |new_feature_2 |
| 0 | 2.5 | 3.5 | 4.5 | f(x11, x12, y11, y12, z11, z12) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
| 1 | 4 | 5 | 6 | f(x21, x22, y21, y22, z21, z22) | g(avg_x1, avg_x2, avg_y1, avg_y2, avg_z1, avg_z2) |
The functions essentially perform individual operations on the data per sample_id to create new features.
Aside: I am looking for a good resource on working with large pandas dataframes and performantley constructing new ones or performing queries. I am almost always able to get the result I want using pandas. My implementations are often not efficient and akin to how it might be done in a lower level language like c++. I would like to improve my working knowledge and maybe this involves some theory I do not know on dataframes and tables etc. A recommendation for a resource would be good. Note that that is just additional helpful information and a recommendation alone does not answer the question and any answer that answers my two use cases above will be accepted with or without a recommendation for a resource.

Categories

Resources