Pandas Regression by group of two columns - python

What I'm going to do
I'd like to get average stock price, regression coefficient and R-square of stock prices in float by stock item, e.g. Apple, Amazon, etc., and certain date period, e.g. Feb. 15 ~ Mar.14. as a part of quantitative investment simulation encompassing 30 years. The problem is that it simply is too slow. At first, I made the whole code with PostgreSQL but it was too slow - didn't finish after 2 hours. After asking a professor friend in management information system, I'm trying pandas for the first time.
The data structure implemented so far look like this:
Raw data (Dataframe named dfStock)
──────────────────────────────────────────
Code | Date | Date Group | Price |
──────────────────────────────────────────
AAPL | 20200205 | 20200205 | ###.## |
AAPL | 20200206 | 20200305 | ###.## |
...
AAPL | 20200305 | 20200305 | ###.## |
AAPL | 20200306 | 20200405 | ###.## |
...
──────────────────────────────────────────
Results (Dataframe named dfSumS)
──────────────────────────────────────────
Code | Date group | Avg. Price | Slope | R-Square
──────────────────────────────────────────
AAPL | 20200205 | ###.## | #.## | #.##
AMZN | 20200205 | ###.## | #.## | #.##
...
AAPL | 20200305 | ###.## | #.## | #.##
AMZN | 20200305 | ###.## | #.## | #.##
...
──────────────────────────────────────────
Code As of Now
'prevdt' corresponds to 'Date Group' in the above and 'compcd' means company code
from sklearn.linear_model import LinearRegression
# Method Tried 1
model = LinearRegression()
def getRegrS(arg_cd, arg_prevdt):
x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy().reshape((-1,1))
y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
model.fit(x, y)
return model.coef_[0], model.score(x,y)
# Method Tried 2
def getRegrS(arg_cd, arg_prevdt):
x = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['rnk'].to_numpy()
y = dfStock[(dfStock['compcd']==arg_cd) & (dfStock['prevdt']==arg_prevdt)]['adjenp'].to_numpy()
rv = stats.linregress(x,y)
return rv[0], rv[2]
dfSumS['rnk'] = dfStock.groupby(['compcd','prevdt']).cumcount()+1
dfSumS[['slope','rsq']]= [getRegrS(cd, prevdt) for cd, prevdt in zip(dfSumS['compcd'], dfSumS['prevdt'])]
What I've tried before
Based on recommendation in this link, I tried vectoriztion, but got the message "Can only compare identically-labeled Series objects". Unable to solve this problem, I came to two functions in the above, which were not fast enough. Both worked with a smaller set of code like the year of 2020, but once the data period became as large as 2~3 decades, it took hours.
I thought of apply, iterrows, etc., but didn't because firstly the link says it's slower than I've done and secondly each of these seem to apply to only one column while I have to two results - coefficient and R-square over the same period so that calling them twice definitely will be slower.
Now I'm trying the pool thing mentioned in a few posts

I'm afraid that if you're trying to run thousands of large linear regressions, then you will have to pay the price in time spent running. If you are only interested in the beta coefficient or the r2 value, it could be more efficient to calculate them separately with numpy as (XtX)^(-1)Xty and cov(X,y)/sqrt(var(X)var(y)) respectively.

Related

How can I implement my random forest timeseries classification model into an web api?

I'm working on my master thesis right now for which I want to implement a machine learning model that analyses timeseries using a randomforest regression. For feature extraction I'm using TSFRESH and the extract_relevant_features function. I later need this model to be available as an API so I'm wondering how I can go about extracting the exact same features I trained my model on from a future timeseries I want to predict?
Thank you for any help :)
The timseries consists of heart rate bpm values from which I want to predict the valence and arousal values on a scale from 1-9 each. I'm using a scientific dataset called emognition wearable dataset.
My final dataframe looks life this:
| | bpm | participant |
|----------------------------|------|-------------|
| 2020-07-16T09:12:51:623529 | 87.0 | 22_1 |
| 2020-07-16T09:12:51:723320 | 87.0 | 22_1 |
| ... | ... | ... |
| 2020-07-16T09:12:52:022693 | 82.0 | 22_1 |
My Y Dataframe like this:
| | arousal | valence |
|------|---------|---------|
| 22_1 | 5 | 2 |
| 22_2 | 4 | 2 |
| ... | ... | ... |
| 22_5 | 6 | 4 |
When I extract the relevant features from my timeseries data with restpect to the given arousal value seperate from valence in the dataset I get the following list of features:
Index(['bpm__number_peaks__n_3',
'bpm__augmented_dickey_fuller__attr_"teststat"__autolag_"AIC"',
'bpm__augmented_dickey_fuller__attr_"pvalue"__autolag_"AIC"',
'bpm__fft_aggregated__aggtype_"centroid"',
'bpm__absolute_sum_of_changes',
'bpm__fft_coefficient__attr_"abs"__coeff_91',
'bpm__ratio_value_number_to_time_series_length',
'bpm__fft_aggregated__aggtype_"variance"',
'bpm__fft_coefficient__attr_"abs"__coeff_60',
'bpm__fft_coefficient__attr_"real"__coeff_4',
...
'bpm__sum_of_reoccurring_data_points', 'bpm__sum_values',
'bpm__fft_coefficient__attr_"abs"__coeff_0',
'bpm__fft_coefficient__attr_"real"__coeff_0',
'bpm__fft_coefficient__attr_"abs"__coeff_9',
'bpm__fft_coefficient__attr_"abs"__coeff_79',
'bpm__fft_coefficient__attr_"abs"__coeff_12',
'bpm__fft_coefficient__attr_"abs"__coeff_52',
'bpm__fft_coefficient__attr_"abs"__coeff_35',
'bpm__lempel_ziv_complexity__bins_5'],
dtype='object', length=123)
For valence I get these features:
Index(['bpm__range_count__max_1000000000000.0__min_0', 'bpm__length',
'bpm__number_peaks__n_1', 'bpm__range_count__max_1__min_-1',
'bpm__fft_aggregated__aggtype_"variance"', 'bpm__abs_energy',
'bpm__sum_of_reoccurring_data_points', 'bpm__number_cwt_peaks__n_1',
'bpm__fft_coefficient__attr_"abs"__coeff_90',
'bpm__fft_coefficient__attr_"real"__coeff_0',
'bpm__fft_coefficient__attr_"abs"__coeff_0', 'bpm__sum_values'],
dtype='object')
So I guess after training I need to build an API that is able to extract these features from any timseseries I pass to it right? How can I go about this?

Best way to update prices using pandas

First of I'm writing this post on my phone while on the road. Sorry for lack of info just trying to get a head start for when I get home.
I have 2 csv files, both of which contain a different amount of columns and a different amount of records. The first file has about 150k records and the second has about 1.2mil records. The first file the first column has values that are both in a column in the second file and values that are not in the second file. What i intend to do is to check if the value in column one of the first file is in the first column of the second file. If so check if the first files second column is less than our greater than a value of a different column in the second file where the first columns match. If so update the first files second column to the new value.
Side note I don't need the fastest or most efficient way I just need a working solution for now. Iwill optimize later. Code will be ran once a month to update csv file.
Currently I am attempting to accomplish this using pandas and loading each file into a dataframe. I am struggling to make this work. If this is the best way could you help me do this? Once I figure out how to do this I can figure out the rest I'm just stuck.
What I thought of before I posted this question that I might try to make a third dataframe containing the columns that hold material values and DCost values where Item column and Material columns match. The looping through the dataframe and if value from Item and Material column match updat cost column in csv file
I didn't know if uploading the csv files to a database and running queries to accomplish this would be easier?
Would converting the dataframes to dicts work with this much data?
File 1
+--------+-------+--------+
| Item | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 24.76 |
| 620388 | 15.78 | 36.99 |
+--------+-------+--------+
File 2
+----------+--------+-----------+
| Material | DCost | List Cost |
+----------+--------+-----------+
| 10C0024 | .24 | 1.56 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
| 2020101 | 100.76 | 267.78 |
+----------+--------+-----------+
Intended result to export to csv.
+--------+-------+--------+
| Labor | Cost | Price |
+--------+-------+--------+
| Labor | 0 | 100.00 |
| 785342 | 12.54 | 23.76 |
| 620388 | 16.99 | 36.99 |
+--------+-------+--------+

How to represent the trend (upward/downward/no change) of data?

I have a dataset where each row represents the number of occurrences of certain behaviour. The columns represent a window of a set amount of time. It looks like this:
+----------+----------+----------+----------+-----------+------+
| Episode1 | Episode2 | Episode3 | Episode4 | Episode5 | ... |
+----------+----------+----------+----------+-----------+------+
| 2 | 0 | 1 | 3 | | |
| 1 | 2 | 4 | 2 | 3 | |
| 0 | | | | | |
+----------+----------+----------+----------+-----------+------+
There are over 150 episodes. I want to find a way to represent each row as a trend, whether the occurrences are exhibiting more/less.
I have tried to first calculate the average/median/sum of every 3/5/10 cells of each row (because each row has different length and many 0 values), and use these to correlate with a horizontal line (which represent the time), the coefficients of these correlations should tell the trend (<0 means downward, >0 upward). The trends will be used in further analysis.
I'm wondering if there is a better way to do this. Thanks.
If you expect the trend to be linear, you could fit a linear regression to each row separately, using time to predict number of occurences of a behavior. Then store the slopes.
This slope represents the effect of increasing time by 1 episode on the behavior. It also naturally accounts for the difference in length of the time series.

Are sales values decreasing over time? Pandas

I have a dataframe full of stock transactions. I want to calculate how I can tell if a certain account is ordering less than they usually do.
NB: some cusotmers might place a big order (double the size of normal orders) and then take twice as long to order again. Or half the size and order double the frequency as normal.
Data layout:
SA_DACCNT = Account number
SA_TRREF = Invoice No
+-----------+------------+----------+------------+------------+
| SA_DACCNT | SA_TRDATE | SA_TRREF | SA_TRVALUE | SA_PRODUCT |
+-----------+------------+----------+------------+------------+
| GO63 | 01/05/2019 | 11587 | 0.98 | R613S/6 |
| GO63 | 01/05/2019 | 11587 | 0.98 | R614S/6 |
| AAA1 | 01/05/2019 | 11587 | 1.96 | R613S/18 |
| GO63 | 01/05/2019 | 11587 | 2.5 | POST3 |
| DOD2 | 01/07/2019 | 11588 | 7.84 | R613S/18 |
+-----------+------------+----------+------------+------------+
So Far:
I've tried to group by customer and resample the columns into quarters and analyse the z score of the last quarter for each customer, but it doesn't always work out correctly as not everyone orders quarterly. So there are gaps which skew the figures.
I've also tired using fitting a linear model to the figures for each customer and using the coefficient as a metric to determine if spend is down. But it still doesn't look right.
I've been looking at is_monotonic_increasing and other such functions but still unable to find what I'm looking for, i'm sure there is a statistical technique for it but it aludes me.
I'm pretty new to all this and am left scratching my head on hows best to tackle it.
My goal is to determine WHO has spent less (over their last few orders) than they usually do.
Bascially who do we need to call / chase / send offers etc.
Any ideas on the correct way to analyse this would be appreaciated. Not looking to copy someones code, I'm sure i can work that bit out myself.
EDIT:
I've also now tried using diff to calcualte the average distace between orders and resampling to this value, then calculating the Z score. But still its not what I expect.

Handling data inside dataframe pandas

I have one problem with one of my projects at school. I am attempting to change the order of my data.
You are able to appreciate how the data is arranged
this picture contains a sample of the data I am referring to
This is the format I am attempting to reach:
Company name | activity description | year | variable 1 | variable 2 |......
company 1 | | 2011 | | |
company 1 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
company 2 | | 2011 | | |
company 2 | | 2012 | | |
..... (one row for every year ( from 2014 to 2015 inclusive))
for ever single one of the 10 companies. this is a sample of my whole data-set, which contains more than 15000 companies. I attempted creating a dataframe of the size I want but I have problems filling it with the data I want and in the format I want. I am fairly new to python. Could anyone help me, please?

Categories

Resources