Are sales values decreasing over time? Pandas - python

I have a dataframe full of stock transactions. I want to calculate how I can tell if a certain account is ordering less than they usually do.
NB: some cusotmers might place a big order (double the size of normal orders) and then take twice as long to order again. Or half the size and order double the frequency as normal.
Data layout:
SA_DACCNT = Account number
SA_TRREF = Invoice No
+-----------+------------+----------+------------+------------+
| SA_DACCNT | SA_TRDATE | SA_TRREF | SA_TRVALUE | SA_PRODUCT |
+-----------+------------+----------+------------+------------+
| GO63 | 01/05/2019 | 11587 | 0.98 | R613S/6 |
| GO63 | 01/05/2019 | 11587 | 0.98 | R614S/6 |
| AAA1 | 01/05/2019 | 11587 | 1.96 | R613S/18 |
| GO63 | 01/05/2019 | 11587 | 2.5 | POST3 |
| DOD2 | 01/07/2019 | 11588 | 7.84 | R613S/18 |
+-----------+------------+----------+------------+------------+
So Far:
I've tried to group by customer and resample the columns into quarters and analyse the z score of the last quarter for each customer, but it doesn't always work out correctly as not everyone orders quarterly. So there are gaps which skew the figures.
I've also tired using fitting a linear model to the figures for each customer and using the coefficient as a metric to determine if spend is down. But it still doesn't look right.
I've been looking at is_monotonic_increasing and other such functions but still unable to find what I'm looking for, i'm sure there is a statistical technique for it but it aludes me.
I'm pretty new to all this and am left scratching my head on hows best to tackle it.
My goal is to determine WHO has spent less (over their last few orders) than they usually do.
Bascially who do we need to call / chase / send offers etc.
Any ideas on the correct way to analyse this would be appreaciated. Not looking to copy someones code, I'm sure i can work that bit out myself.
EDIT:
I've also now tried using diff to calcualte the average distace between orders and resampling to this value, then calculating the Z score. But still its not what I expect.

Related

How to implement Random Forest with a lot of categorical columns?

I want to implement a random forest to predict which customers will be a 'goner' and which customers will be a 'stayer'.
I have data for customers who already left or are still there (it's just my imagination no real customer data).
Now I have a couple of columns that look like this:
CustomerNR | Email (Y/N) | Age | Water Usage (in l) | How did we Contact them?
1 | Yes | 20-30 | 1000l | Mail
2 | No | 50-70 | 500l | Telephone
3 | Yes | 40-50 | 1099l | NAN
How would I start with this?
I really am inexperienced and don't find the tutorials online helpful because they are always on stuff with numbers like weather predictions etc.
I have 200k "Customers" in my Dataset and wanted to know if there is a good tutorial for this or at least some directions where i could go.

Pandas Regression by group of two columns

What I'm going to do
I'd like to get average stock price, regression coefficient and R-square of stock prices in float by stock item, e.g. Apple, Amazon, etc., and certain date period, e.g. Feb. 15 ~ Mar.14. as a part of quantitative investment simulation encompassing 30 years. The problem is that it simply is too slow. At first, I made the whole code with PostgreSQL but it was too slow - didn't finish after 2 hours. After asking a professor friend in management information system, I'm trying pandas for the first time.
The data structure implemented so far look like this:
Raw data (Dataframe named dfStock)
──────────────────────────────────────────
Code | Date | Date Group | Price |
──────────────────────────────────────────
AAPL | 20200205 | 20200205 | ###.## |
AAPL | 20200206 | 20200305 | ###.## |
...
AAPL | 20200305 | 20200305 | ###.## |
AAPL | 20200306 | 20200405 | ###.## |
...
──────────────────────────────────────────
Results (Dataframe named dfSumS)
──────────────────────────────────────────
Code | Date group | Avg. Price | Slope | R-Square
──────────────────────────────────────────
AAPL | 20200205 | ###.## | #.## | #.##
AMZN | 20200205 | ###.## | #.## | #.##
...
AAPL | 20200305 | ###.## | #.## | #.##
AMZN | 20200305 | ###.## | #.## | #.##
...
──────────────────────────────────────────
Code As of Now
'prevdt' corresponds to 'Date Group' in the above and 'compcd' means company code
from sklearn.linear_model import LinearRegression
# Method Tried 1
model = LinearRegression()
def getRegrS(arg_cd, arg_prevdt):
x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy().reshape((-1,1))
y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
model.fit(x, y)
return model.coef_[0], model.score(x,y)
# Method Tried 2
def getRegrS(arg_cd, arg_prevdt):
x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy()
y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
rv = stats.linregress(x,y)
return rv[0], rv[2]
dfSumS['rnk'] = dfStock.groupby(['compcd','prevdt']).cumcount()+1
dfSumS[['slope','rsq']]= [getRegrS(cd, prevdt) for cd, prevdt in zip(dfSumS['compcd'], dfSumS['prevdt'])]
What I've tried before
Based on recommendation in this link, I tried vectoriztion, but got the message "Can only compare identically-labeled Series objects". Unable to solve this problem, I came to two functions in the above, which were not fast enough. Both worked with a smaller set of code like the year of 2020, but once the data period became as large as 2~3 decades, it took hours.
I thought of apply, iterrows, etc., but didn't because firstly the link says it's slower than I've done and secondly each of these seem to apply to only one column while I have to two results - coefficient and R-square over the same period so that calling them twice definitely will be slower.
Now I'm trying the pool thing mentioned in a few posts
I'm afraid that if you're trying to run thousands of large linear regressions, then you will have to pay the price in time spent running. If you are only interested in the beta coefficient or the r2 value, it could be more efficient to calculate them separately with numpy as (XtX)^(-1)Xty and cov(X,y)/sqrt(var(X)var(y)) respectively.

Finding matching mathematical formulas in dataset

I have this kind of data set (originally comes from Excel so I past it like this in here). This is a ice hocket statistics
game_id | home_team_shots | home_team_saves | home_team_goals | away_team_shots | away_team_saves | away_team_goals
123 | 20 | 13 | 3 | 15 | 17 | 2
124 | 23 | 28 | 5 | 30 | 17 | 5
-----
There are about 15-20 rows like this depending on the exact dataset, I have about 50 of these excel.
Now I would need to find simple mathematical formulas from this data. Basically if in some column all values are less than X, equal to X or more than X. All the columns won't contain any such formula which is ok.
For example for the data above should be identified that away_team_saves == 17. And home_team_shots < 24. Of course all columns are not important so I should be able to define which columns to use.
I have mainly looking for solution in Python. I don't need an exact solution, just a push to the right direction. I am quite the beginner in Python and don't know where I should start looking.
All the help is very appreciated.

Django Subquery Subset of Previous Records

I am trying to annotate a queryset with an aggregate of a subset of previous rows. Take the following example table of a player's score in a particular game, with the column, last_2_average_score being the rolling average from the previous two games score for a particular player.
+----------+-----------+---------+-------------------------+
| date | player | score | last_2_average_score |
+----------+-----------+---------+-------------------------+
| 12/01/19 | 1 | 10 | None |
| 12/02/19 | 1 | 9 | None |
| 12/03/19 | 1 | 8 | 9.5 |
| 12/04/19 | 1 | 7 | 8.5 |
| 12/05/19 | 1 | 6 | 7.5 |
+----------+-----------+---------+-------------------------+
In order to accomplish this, i wrote the following query, trying to annotate each "row" with the corresponding 2 game average for their score
ScoreModel.objects.annotate(
last_two_average_score=Subquery(
ScoreModel.objects.filter(
player=OuterRef("player"), date__lt=OuterRef("date")
)
.order_by("-date")[:2]
.annotate(Avg("score"))
.values("score__avg")[:1],
output_field=FloatField(),
)
)
This query however, does not output the correct result. In fact the result is just every record annotated with
{'last_two_average_score': None}
I have tried a variety of different combinations of the query, and cannot find the correct combination. Any advice that you can give would be much appreciated!
Instead of trying to address the problem from the ORM first, I ended up circling back and first trying to implement the query in raw SQL. This immediately lead me to the concept of WINDOW functions, which when I looked in Django's ORM for, found very quickly.
https://docs.djangoproject.com/en/3.0/ref/models/expressions/#window-functions
For this interested, the resulting query looks something like this, which was much simpler than what I was trying to accomplish with Subquery
ScoreModel.objects.annotate(
last_two_average=Window(
expression=Avg("score"),
partition_by=[F("player")],
order_by=[F("date").desc()],
frame=RowRange(start=-2, end=0),
)
)

How to represent the trend (upward/downward/no change) of data?

I have a dataset where each row represents the number of occurrences of certain behaviour. The columns represent a window of a set amount of time. It looks like this:
+----------+----------+----------+----------+-----------+------+
| Episode1 | Episode2 | Episode3 | Episode4 | Episode5 | ... |
+----------+----------+----------+----------+-----------+------+
| 2 | 0 | 1 | 3 | | |
| 1 | 2 | 4 | 2 | 3 | |
| 0 | | | | | |
+----------+----------+----------+----------+-----------+------+
There are over 150 episodes. I want to find a way to represent each row as a trend, whether the occurrences are exhibiting more/less.
I have tried to first calculate the average/median/sum of every 3/5/10 cells of each row (because each row has different length and many 0 values), and use these to correlate with a horizontal line (which represent the time), the coefficients of these correlations should tell the trend (<0 means downward, >0 upward). The trends will be used in further analysis.
I'm wondering if there is a better way to do this. Thanks.
If you expect the trend to be linear, you could fit a linear regression to each row separately, using time to predict number of occurences of a behavior. Then store the slopes.
This slope represents the effect of increasing time by 1 episode on the behavior. It also naturally accounts for the difference in length of the time series.

Categories

Resources