pd.DataFrame.ewm().cov() just get the last cov - python

mat is a pd.DataFrame of daily ret of several stocks.
the function of pd.DataFrame.ewm().cov() compute each day covariance of these stocks. when the shape of mat is 250 * 2972, it cost too much time and return 250 matrix of covariance. But i just want to get the last one of these cov. How could i do more easily and save some time ?
mat.head()
SecuCode 000001 000002 000004 000005
TradingDay
2016-08-31 0.00211 0.09547 0.0 0.00802
2016-09-01 -0.00422 -0.06163 0.0 -0.01746
2016-09-02 0.00000 0.01398 0.0 -0.00680
2016-09-05 -0.00318 -0.01398 0.0 0.00408
2016-09-06 -0.00106 -0.00513 0.0 0.01081
5 rows × 2972 columns

Related

Algorithm for predict(start=start_date, end=end_date) on unique same named weather station

Story that pertain to a new design solution
Goal is to use weather data to run ARIMA model fit on each group of like named 'stations' with their associated precipitation data, then finally execute a 30 day forward forecast. Looking to process specific same named stations and then next process the next unique same named stations, etc.
The algorithm to add question
How to write algorithm to run ARIMA model for each UNIQUE 'station' and, perhaps grouping stations to be unique groups to run ARIMA model on the group, and then fit for a 30 day forward forecast? The ARIMA(2,1,1) is a working order terms from auto.arima().
How to write a group algorithm for same named 'stations' before running the ARIMA model, fit, forecast? Or what other approach would achieve a set of like named stations to process specific same named stations and then move unto the next unique same named stations.
Working code executes but needs broader algorithm
Code was working, but on last run, predict(start=start_date, end=end_date) issued a key error. Removed NA, so this may fix the predict(start, end)
wd.weather_data = wd.weather_data[wd.weather_data['date'].notna()]
forecast_models = [50000]
n = 1
df_all_stations = data_prcp.drop(['level_0', 'index', 'prcp'], axis=1)
wd.weather_data.sort_values("date", axis = 0, ascending = True, inplace = True)
for station_name in wd.weather_data['station']:
start_date = pd.to_datetime(wd.weather_data['date'])
number_of_days = 31
end_date = pd.to_datetime(start_date) + pd.DateOffset(days=30)
model = statsmodels.tsa.arima_model.ARIMA(wd.weather_data['prcp'], order=(2,1,1))
model_fit = model.fit()
forecast = model_fit.predict(start=start_date, end=end_date)
forecast_models.append(forecast)
Data Source
<bound method NDFrame.head of station date tavg tmin tmax prcp snow
0 Anchorage, AK 2018-01-01 -4.166667 -8.033333 -0.30 0.3 80.0
35328 Grand Forks, ND 2018-01-01 -14.900000 -23.300000 -6.70 0.0 0.0
86016 Key West, FL 2018-01-01 20.700000 16.100000 25.60 0.0 0.0
59904 Wilmington, NC 2018-01-01 -2.500000 -7.100000 0.00 0.0 0.0
66048 State College, PA 2018-01-01 -13.500000 -17.000000 -10.00 4.5 0.0
... ... ... ... ... ... ... ...
151850 Kansas City, MO 2022-03-30 9.550000 3.700000 16.55 21.1 0.0
151889 Springfield, MO 2022-03-30 12.400000 4.500000 17.10 48.9 0.0
151890 St. Louis, MO 2022-03-30 14.800000 8.000000 17.60 24.9 0.0
151891 State College, PA 2022-03-30 0.400000 -5.200000 6.20 0.2 0.0
151899 Wilmington, NC 2022-03-30 14.400000 6.200000 20.20 0.0 0.0
wdir wspd pres
0 143.0 5.766667 995.133333
35328 172.0 33.800000 1019.200000
86016 4.0 13.000000 1019.900000
59904 200.0 21.600000 1017.000000
66048 243.0 12.700000 1015.200000
... ... ... ...
151850 294.5 24.400000 998.000000
151889 227.0 19.700000 997.000000
151890 204.0 20.300000 996.400000
151891 129.0 10.800000 1020.400000
151899 154.0 16.400000 1021.900000
Error
KeyError: 'The `start` argument could not be matched to a location related to the index of the data.'

Why data retrieved with get_pvgis_hourly doesn't match with the ones from PV Performance Tool?

I am interested into retrieving the value of the yearly in plane irradiation via code, given a database of parameters to be taken from the function get_pvgis_hourly, from this parser\getter, as follows:
get_pvgis_hourly(45.858217, 12.267183, angle=7, aspect=-44, outputformat='csv',
usehorizon=True, userhorizon=None, raddatabase="PVGIS-SARAH",
startyear=None, endyear=None, pvcalculation=False,
peakpower=1, pvtechchoice='crystSi',
mountingplace='free', loss=14, trackingtype=0,
optimal_inclination=False, optimalangles=False,
components=False, url=URL, map_variables=True, timeout=30)[0].sum()
and the output is the following:
poa_global 17993672.40
solar_elevation 1489417.07
temp_air 1417261.89
wind_speed 213468.18
Int 2386.00
dtype: float64
But if I use the samen data in PVGIS - PV Performance Tool like this
I obtain different data:
Any hint would be apreciated.
I suggest getting into the habit of examining the underlying data prior to summarizing it. Oftentimes you will find that the assumptions you had about the data don't hold. Printing and plotting the data are good places to start.
data, inputs, meta = get_pvgis_hourly(45.858217, 12.267183, angle=7, aspect=-44, outputformat='csv',
usehorizon=True, userhorizon=None, raddatabase="PVGIS-SARAH",
startyear=None, endyear=None, pvcalculation=False,
peakpower=1, pvtechchoice='crystSi',
mountingplace='free', loss=14, trackingtype=0,
optimal_inclination=False, optimalangles=False,
components=False, url=URL, map_variables=True, timeout=30)
print(data['poa_global'])
time
2005-01-01 00:10:00+00:00 0.0
2005-01-01 01:10:00+00:00 0.0
2005-01-01 02:10:00+00:00 0.0
2005-01-01 03:10:00+00:00 0.0
2005-01-01 04:10:00+00:00 0.0
2016-12-31 19:10:00+00:00 0.0
2016-12-31 20:10:00+00:00 0.0
2016-12-31 21:10:00+00:00 0.0
2016-12-31 22:10:00+00:00 0.0
2016-12-31 23:10:00+00:00 0.0
Name: poa_global, Length: 105192, dtype: float64
data['poa_global'].plot()
This shows that data holds hourly values spanning 12 years, so when you calculate the sum of the entire thing, it is total insolation over 12 years. The PVGIS website divides by 12 to get average annual insolation. Finally, notice that there is a units difference (kWh/m^2 vs Wh/m^2), so dividing by 1000 gets things lining up:
In [25]: data['poa_global'].sum() / 12 / 1000
Out[25]: 1499.4727

How to use Prophet's make_future_dataframe with multiple regressors?

make_future_dataframe seems to only produce a dataframe with date (ds) values, which in turn results in ValueError: Regressor 'var' missing from dataframe when attempting to generate forecasts when using the code below.
m = Prophet()
m.add_country_holidays(country_name='US')
m.add_regressor('var')
m.fit(df)
forecasts = m.predict(m.make_future_dataframe(periods=7))
Looking through the python docs, there doesn't seem to be any mention of how to combat this issue using Prophet. Is my only option to write additional code to lag all regressors by the period for which I want to generate forecasts (ex. take var at t-7 to produce a 7 day daily forecast)?
The issue here is that the future = m.make_future_dataframe method creates a dataset future where the only column is the ds date column. In order to predict using a model with regressors you also need columns for each regressor in the future dataset.
Using my original training data which I called regression_data, I solved this by predicting the values for the regressor variables and then filling those into a future_w_regressors dataset which was a merge of future and regression_data.
Assume you have a trained model model ready.
# List of regressors
regressors = ['Total Minutes','Sent Emails','Banner Active']
# My data is weekly so I project out 1 year (52 weeks), this is what I want to forecast
future = model.make_future_dataframe(52, freq='W')
at this point if you run model.predict(future) you will get the error you've been getting. What we need to do is incorporate the regressors. . I merge regression_data with future so that the observations from the past are filled. As you can see, the observations looking forward are empty (towards the end of the table)
# regression_data is the dataframe I used to train the model (include all covariates)
# merge the data you used to train the model
future_w_regressors = regression_data[regressors+['ds']].merge(future, how='outer', on='ds')
future_w_regressors
Total Minutes Sent Emails Banner Active ds
0 7.129552 9.241493e-03 0.0 2018-01-07
1 7.157242 8.629305e-14 0.0 2018-01-14
2 7.155367 8.629305e-14 0.0 2018-01-21
3 7.164352 8.629305e-14 0.0 2018-01-28
4 7.165526 8.629305e-14 0.0 2018-02-04
... ... ... ... ...
283 NaN NaN NaN 2023-06-11
284 NaN NaN NaN 2023-06-18
285 NaN NaN NaN 2023-06-25
286 NaN NaN NaN 2023-07-02
287 NaN NaN NaN 2023-07-09
Solution 1: Predict Regressors
For the next step I create a dataset with only the empty regressor values in it, loop through each regressor, train a naive prophet model on each, predict their values for the future date, fill those values into the empty regressors dataset and place those values into the future_w_regressors dataset.
# Get the segment for which we have no regressor values
empty_future = future_w_regressors[future_w_regressors[regressors[0]].isnull()]
only_future = empty_future[['ds']]
# Create a dictionary to hold the different independent variable forecasts
for regressor in regressors:
# Prep a new training dataset
train = regression_data[['ds',regressor]]
train.columns = ['ds','y'] # rename the variables so they can be submitted to the prophet model
# Train a model for this regressor
rmodel = Prophet()
rmodel.weekly_seasonality = False # this is specific to my case
rmodel.fit(train)
regressor_predictions = rmodel.predict(only_future)
# Replace the empty values in the empty dataset with the predicted values from the regressor model
empty_future[regressor] = regressor_predictions['yhat'].values
# Fill in the values for all regressors in the future_w_regressors dataset
future_w_regressors.loc[future_w_regressors[regressors[0]].isnull(), regressors] = empty_future[regressors].values
Now the future_w_regressors table no longer has missing values
future_w_regressors
Total Minutes Sent Emails Banner Active ds
0 7.129552 9.241493e-03 0.000000 2018-01-07
1 7.157242 8.629305e-14 0.000000 2018-01-14
2 7.155367 8.629305e-14 0.000000 2018-01-21
3 7.164352 8.629305e-14 0.000000 2018-01-28
4 7.165526 8.629305e-14 0.000000 2018-02-04
... ... ... ... ...
283 7.161023 -1.114906e-02 0.548577 2023-06-11
284 7.156832 -1.138025e-02 0.404318 2023-06-18
285 7.150829 -5.642398e-03 0.465311 2023-06-25
286 7.146200 -2.989316e-04 0.699624 2023-07-02
287 7.145258 1.568782e-03 0.962070 2023-07-09
And I can run the predict command to get my forecasts which now extend into 2023 (original data ended in 2022):
model.predict(future_w_regressors)
ds trend yhat_lower yhat_upper trend_lower trend_upper Banner Active Banner Active_lower Banner Active_upper Sent Emails Sent Emails_lower Sent Emails_upper Total Minutes Total Minutes_lower Total Minutes_upper additive_terms additive_terms_lower additive_terms_upper extra_regressors_additive extra_regressors_additive_lower extra_regressors_additive_upper yearly yearly_lower yearly_upper multiplicative_terms multiplicative_terms_lower multiplicative_terms_upper yhat
0 2018-01-07 2.118724 2.159304 2.373065 2.118724 2.118724 0.000000 0.000000 0.000000 3.681765e-04 3.681765e-04 3.681765e-04 0.076736 0.076736 0.076736 0.152302 0.152302 0.152302 0.077104 0.077104 0.077104 0.075198 0.075198 0.075198 0.0 0.0 0.0 2.271026
1 2018-01-14 2.119545 2.109899 2.327498 2.119545 2.119545 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077034 0.077034 0.077034 0.098945 0.098945 0.098945 0.077034 0.077034 0.077034 0.021911 0.021911 0.021911 0.0 0.0 0.0 2.218490
2 2018-01-21 2.120366 2.074524 2.293829 2.120366 2.120366 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077014 0.077014 0.077014 0.064139 0.064139 0.064139 0.077014 0.077014 0.077014 -0.012874 -0.012874 -0.012874 0.0 0.0 0.0 2.184506
3 2018-01-28 2.121187 2.069461 2.279815 2.121187 2.121187 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077110 0.077110 0.077110 0.050180 0.050180 0.050180 0.077110 0.077110 0.077110 -0.026931 -0.026931 -0.026931 0.0 0.0 0.0 2.171367
4 2018-02-04 2.122009 2.063122 2.271638 2.122009 2.122009 0.000000 0.000000 0.000000 3.437872e-15 3.437872e-15 3.437872e-15 0.077123 0.077123 0.077123 0.046624 0.046624 0.046624 0.077123 0.077123 0.077123 -0.030498 -0.030498 -0.030498 0.0 0.0 0.0 2.168633
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
283 2023-06-11 2.062645 2.022276 2.238241 2.045284 2.078576 0.025237 0.025237 0.025237 -4.441732e-04 -4.441732e-04 -4.441732e-04 0.077074 0.077074 0.077074 0.070976 0.070976 0.070976 0.101867 0.101867 0.101867 -0.030891 -0.030891 -0.030891 0.0 0.0 0.0 2.133621
284 2023-06-18 2.061211 1.975744 2.199376 2.043279 2.077973 0.018600 0.018600 0.018600 -4.533835e-04 -4.533835e-04 -4.533835e-04 0.077029 0.077029 0.077029 0.025293 0.025293 0.025293 0.095176 0.095176 0.095176 -0.069883 -0.069883 -0.069883 0.0 0.0 0.0 2.086504
285 2023-06-25 2.059778 1.951075 2.162531 2.041192 2.077091 0.021406 0.021406 0.021406 -2.247903e-04 -2.247903e-04 -2.247903e-04 0.076965 0.076965 0.076965 0.002630 0.002630 0.002630 0.098146 0.098146 0.098146 -0.095516 -0.095516 -0.095516 0.0 0.0 0.0 2.062408
286 2023-07-02 2.058344 1.953027 2.177666 2.039228 2.076373 0.032185 0.032185 0.032185 -1.190929e-05 -1.190929e-05 -1.190929e-05 0.076915 0.076915 0.076915 0.006746 0.006746 0.006746 0.109088 0.109088 0.109088 -0.102342 -0.102342 -0.102342 0.0 0.0 0.0 2.065090
287 2023-07-09 2.056911 1.987989 2.206830 2.037272 2.075110 0.044259 0.044259 0.044259 6.249949e-05 6.249949e-05 6.249949e-05 0.076905 0.076905 0.076905 0.039813 0.039813 0.039813 0.121226 0.121226 0.121226 -0.081414 -0.081414 -0.081414 0.0 0.0 0.0 2.096724
288 rows × 28 columns
Note that I trained the model for each regressor naively. However, you could optimize prediction for those independent variables if you wanted to.
Solution 2: Use last year's regressor values
Alternatively, you could just say that you don't want to compound the uncertainty of regressor forecasts on your main forecast and just want an idea of how forecasts might change for different values of the regressor. In that case you might just copy the regressor values from the last year into the missing future_w_regressors dataset. This has the added benefit of easily simulating drops or increases relative to current regressor levels:
from datetime import timedelta
last_date = regression_data.iloc[-1]['ds']
one_year_ago = last_date - timedelta(days=365) # works with data at any scale
last_year_of_regressors = regression_data.loc[regression_data['ds']>one_year_ago, regressors]
# If you want to simulate a 10% drop in levels compared to this year
last_year_of_regressors = last_year_of_regressors * 0.9
future_w_regressors.loc[future_w_regressors[regressors[0]].isnull(), regressors] = last_year_of_regressors.iloc[:len(future_w_regressors[future_w_regressors[regressors[0]].isnull()])].values

Select value from dataframe based on other dataframe

i try to calculate the position of an object based on a timestamp. For this I have two dataframes in pandas. One for the measurement data and one for the position. All the movement is a straightforward acceleration.
Dataframe 1 contains the measurement data:
ms force ... ... ...
1 5 20
2 10 20
3 15 25
4 20 30
5 25 20
..... (~ 6000 lines)
Dataframe 2 contains "positioning data"
ms speed (m/s)
1 0 0.66
2 4500 0.66
3 8000 1.3
4 16000 3.0
5 20000 3.0
.....(~300 lines)
Now I want to calculate the position of the first dataframe with the data from secound dataframe
In Excel I solved the problem by using an array formular but now I have to use Python/Pandas and I cant find a way how to select the correct row from dataframe 2.
My idea is to make something like this: if
In the end I want to display a graph "force <-> way" and not "force <-> time"
Thank you in andvance
==========================================================================
Update:
In the meantime I could almost solve my issue. Now my Data look like this:
Dataframe 2 (Speed Data):
pos v a t t-end t-start
0 -3.000 0.666667 0.000000 4.500000 4.500000 0.000000
1 0.000 0.666667 0.187037 0.071287 4.571287 4.500000
2 0.048 0.680000 0.650794 0.010244 4.581531 4.571287
3 0.055 0.686667 0.205432 0.064904 4.646435 4.581531
...
15 0.055 0.686667 0.5 0.064904 23.0 20.0
...
28 0.055 0.686667 0.6 0.064904 35.0 34.0
...
30 0.055 0.686667 0.9 0.064904 44.0 39.0
And Dataframe 1 (time based measurement):
Fx Fy Fz abs_t expected output ('a' from DF1)
0 -13.9 170.3 45.0 0.005 0.000000
1 -14.1 151.6 38.2 0.010 0.000000
...
200 -14.1 131.4 30.4 20.015 0.5
...
300 -14.3 111.9 21.1 34.01 0.6
...
400 -14.5 95.6 13.2 40.025
So i want to check the time(abs_t) from DF1 and search for the corract 'a' in DF2
So somthing like this (pseudo code):
if (DF1['t_abs'] between (DF2['t-start'], DF2['t-end']):
DF1['a'] = DF2['a']
I could make two for loops but it looks like the wrong way and is very very slow.
I hope you understand my problem; to provide a running sample is very hard.
In Excel I did like this:
I found a very slow solution but atleast its working :(
df1['a'] = 0
for index, row in df2.iterrows():
start = row['t-start']
end = row ['t-end']
a = row ['a']
df1.loc[(df1['tabs']>start)&(df1['tabs']<end), 'a'] = a

How to have a date based x axis and add vertical lines based on a list of dates with ggplot?

I have created this graph:
def myGraphNew (cc,ccBad,tY,tTitle):
p1 = ggplot(cc, aes('DT', tY, color='Type')) + \
geom_line(size=2.) + \
xlab("Date") + \
ggtitle(tTitle) + \
scale_x_date(labels='%d-%b',breaks=date_breaks('week'))
return p1
I supply this data to produce the base graph:
0 2016-09-22 Covered by Test 0.0
1 2016-09-25 Covered by Test 1.0
2 2016-09-26 Covered by Test -3.0
3 2016-09-27 Covered by Test 1.0
4 2016-09-28 Covered by Test 5.0
5 2016-09-30 Covered by Test 0.0
6 2016-10-02 Covered by Test 113.0
7 2016-10-03 Covered by Test 67.0
8 2016-10-04 Covered by Test 105.0
0 2016-09-22 Total 0.0
1 2016-09-25 Total 47.0
2 2016-09-26 Total -72.0
3 2016-09-27 Total -13.0
4 2016-09-28 Total 16.0
5 2016-09-30 Total 0.0
6 2016-10-02 Total 58.0
7 2016-10-03 Total 77.0
8 2016-10-04 Total 334.0
I also have a list that contains a datetime I want to draw a vertical line for. Currently this list only has one value but over time I expect that it will hold multiple values. These dates are generated as part of the process that cleans the input data.
I have added the line
geom_vline (xintercept=ccBad)
But I get the error
in axvline scalex = (xx < xmin) or (xx > xmax)
TypeError: unorderable types: float() > NoneType()
and the print of the graph shows this after error graph with missing information rather than the graph that is produced without the geom_vline as shown here

Categories

Resources