How to calculate and apply exponential function to a time series - python

I currently have the below time series and have a 4th degree polynomial as my best fit line. How do I calculate the correct coffeccients and apply an exponetial fucntion to this series? My goal is to get some type of growth rate.
data2 = StringIO("""
date value
09-Oct-17 0.304
10-Nov-17 0.316
26-Nov-17 0.636
12-Dec-17 0.652
28-Dec-17 0.639
13-Jan-18 0.623
02-Mar-18 0.619
18-Mar-18 0.608
19-Apr-18 0.605
05-May-18 0.625
06-Jun-18 0.639
22-Jun-18 0.663
08-Jul-18 0.64
24-Jul-18 0.623
09-Aug-18 0.632
28-Oct-18 0.736
""")
df2 = pd.read_table(data2, delim_whitespace=True)
df2.loc[:, "date"] = pd.to_datetime(df2.loc[:, "date"], format="%d-%b-%y")
y_values2 = df2.loc[:, "value"]
x_values2 = np.linspace(0,1,len(df2.loc[:, "value"]))
poly_degree = 4
coeffs2 = np.polyfit(x_values2, y_values2, poly_degree)
poly_eqn2 = np.poly1d(coeffs2)
y_hat2 = poly_eqn2(x_values2)
plt.figure(figsize=(12,8))
plt.plot(df.loc[:, "date"], df.loc[:,"value"] ,"ro",color='green')
plt.plot(df.loc[:, "date"],y_hat)
plt.plot(df2.loc[:, "date"], df2.loc[:,"value"] ,"ro",color='red')
plt.plot(df2.loc[:, "date"],y_hat2)
plt.title('WSC-10-50')
plt.ylabel('NDVI')
plt.xlabel('Date')
plt.savefig("NDVI_plot.png")

Related

How to subtract buy/sell rows for each group in dataframe

I have a dataframe that looks like this:
symbol
side
min
max
mean
wav
1000038
buy
0.931
1.0162
0.977
0.992
1000038
sell
0.932
1.0173
0.978
0.995
1000039
buy
0.881
1.00
0.99
0.995
1000039
sell
0.885
1.025
0.995
1.001
What is the most pythonic (efficient) way of generating a new dataframe consisting of the differences between the buys and the sells of each symbol.
For example: symbol 1000038, the difference between the and min sell and min buy is (0.932 - 0.931) = 0.001.
I am seeking a method that avoids looping through the dataframe rows as I believe this would be inefficient. Instead looking for a grouping type of solution.
I have tried something like this:
df1 = stats[['symbol','side']].join(stats[['mean','wav']].diff(-1))
df2 = df1[df1['side']=='sell']
print(df2)
but it does not seem to work as expected.
You could use the pandas MultiIndex. First, set up the data:
import pandas as pd
columns = ('symbol', 'side', 'min', 'max', 'mean', 'wav')
data = [
(1000038, 'buy', 0.931, 1.0162, 0.977, 0.992),
(1000038, 'sell', 0.932, 1.0173, 0.978, 0.995),
(1000039, 'buy', 0.881, 1.00, 0.99, 0.995),
(1000039, 'sell', 0.885, 1.025, 0.995, 1.001),
]
df = pd.DataFrame(data = data, columns = columns)
Then, create the index and compute the difference between two data frames:
df2 = df.set_index(['side', 'symbol'], verify_integrity=True)
df2 = df2.sort_index()
df2.loc[('buy',), :] - df2.loc[('sell',), :]
The result is:
min max mean wav
symbol
1000038 -0.001 -0.0011 -0.001 -0.003
1000039 -0.004 -0.0250 -0.005 -0.006
I'm assuming that each symbol (like 1000038) appears twice. You could use fillna() if you have un-matched buys and sells.
If needed, start with drop_duplicates and sort_values to make sure each symbol only has 1xbuy and 1xsell (in that order):
df = df.drop_duplicates(['symbol', 'side']).sort_values(['symbol', 'side'])
Then use either xs (faster) or groupby.diff for the group subtractions.
xs
Set the index to ['side', 'symbol'] and use xs to get cross-sections for buy and sell:
df.set_index(['side', 'symbol']).pipe(lambda df: df.xs('sell') - df.xs('buy'))
# min max mean wav
# symbol
# 1000038 0.001 0.0011 0.001 0.003
# 1000039 0.004 0.0250 0.005 0.006
groupby.diff
Set the index to symbol and subtract the groups using groupby.diff:
df.drop(columns='side').set_index('symbol').groupby('symbol').diff().dropna()
# min max mean wav
# symbol
# 1000038 0.001 0.0011 0.001 0.003
# 1000039 0.004 0.0250 0.005 0.006
- To flip the subtraction order, use diff(-1).
- If your version throws an error with groupby('symbol'), use groupby(level=0).

Convert day numbers into dates in python

How do you convert day numbers (1,2,3...728,729,730) to dates in python? I can assign an arbitrary year to start the date count as the year doesn't matter to me.
I am working on learning time series analysis, ARIMA, SARIMA, etc using python. I have a CSV dataset with two columns, 'Day' and 'Revenue'. The Day column contains numbers 1-731, Revenue contains numbers 0-18.154... I have had success building the model, running statistical tests, building visualizations, etc. But when it comes to forecasting using prophet I am hitting a wall.
Here are what I feel are the relevant parts of the code related to the question:
# Loading the CSV with pandas. This code converts the "Day" column into the index.
df = read_csv("telco_time_series.csv", index_col=0, parse_dates=True)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 731 entries, 1 to 731
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Revenue 731 non-null float64
dtypes: float64(1)
memory usage: 11.4 KB
df.head()
Revenue
Day
1 0.000000
2 0.000793
3 0.825542
4 0.320332
5 1.082554
# Instantiate the model
model = ARIMA(df, order=(4,1,0))
# Fit the model
results = model.fit()
# Print summary
print(results.summary())
# line plot of residuals
residuals = (results.resid)
residuals.plot()
plt.show()
# density plot of residuals
residuals.plot(kind='kde')
plt.show()
# summary stats of residuals
print(residuals.describe())
SARIMAX Results
==============================================================================
Dep. Variable: Revenue No. Observations: 731
Model: ARIMA(4, 1, 0) Log Likelihood -489.105
Date: Tue, 03 Aug 2021 AIC 988.210
Time: 07:29:55 BIC 1011.175
Sample: 0 HQIC 997.070
- 731
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 -0.4642 0.037 -12.460 0.000 -0.537 -0.391
ar.L2 0.0295 0.040 0.746 0.456 -0.048 0.107
ar.L3 0.0618 0.041 1.509 0.131 -0.018 0.142
ar.L4 0.0366 0.039 0.946 0.344 -0.039 0.112
sigma2 0.2235 0.013 17.629 0.000 0.199 0.248
===================================================================================
Ljung-Box (L1) (Q): 0.01 Jarque-Bera (JB): 2.52
Prob(Q): 0.90 Prob(JB): 0.28
Heteroskedasticity (H): 1.01 Skew: -0.05
Prob(H) (two-sided): 0.91 Kurtosis: 2.73
===================================================================================
df.columns=['ds','y']
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
m = Prophet()
m.fit(df)
ValueError: Dataframe must have columns "ds" and "y" with the dates and values
respectively.
I've had success with the forecast using prophet if I fill the values in the CSV with dates, but I would like to convert the Day numbers within the code using pandas.
Any ideas?
I can assign an arbitrary year to start the date count as the year doesn't matter to me(...)Any ideas?
You might harness datetime.timedelta for this task. Select any date you wish as day 0 and then add datetime.timedelta(days=x) where x is your day number, for example:
import datetime
day0 = datetime.date(2000,1,1)
day120 = day0 + datetime.timedelta(days=120)
print(day120)
output
2000-04-30
encase in function and .apply if you have pandas.DataFrame like so
import datetime
import pandas as pd
def convert_to_date(x):
return datetime.date(2000,1,1)+datetime.timedelta(days=x)
df = pd.DataFrame({'day_n':[1,2,3,4,5]})
df['day_date'] = df['day_n'].apply(convert_to_date)
print(df)
output
day_n day_date
0 1 2000-01-02
1 2 2000-01-03
2 3 2000-01-04
3 4 2000-01-05
4 5 2000-01-06

How to display `.value_counts()` in interval in pandas dataframe

I need to display .value_counts() in interval in pandas dataframe. Here's my code
prob['bucket'] = pd.qcut(prob['prob good'], 20)
grouped = prob.groupby('bucket', as_index = False)
kstable = pd.DataFrame()
kstable['min_prob'] = grouped.min()['prob good']
kstable['max_prob'] = grouped.max()['prob good']
kstable['counts'] = prob['bucket'].value_counts()
My Output
min_prob max_prob counts
0 0.26 0.48 NaN
1 0.49 0.52 NaN
2 0.53 0.54 NaN
3 0.55 0.56 NaN
4 0.57 0.58 NaN
I know that I have pronblem in kstable['counts'] syntax, but how to solve this?
Use named aggregation for simplify your code, for counts is used GroupBy.size to new column counts and is apply function for column bucket:
prob['bucket'] = pd.qcut(prob['prob good'], 20)
kstable = prob.groupby('bucket', as_index = False).agg(min_prob=('prob good','min'),
max_prob=('prob good','max'),
counts=('bucket','size'))
In your solution should working with DataFrame.assign:
kstable = kstable.assign(counts = prob['bucket'].value_counts())

Pandas mean() for multiindex

I have df:
CU Parameters 1 2 3
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.3
I would like to create two separate dfs, getting the mean of column values by grouping index on level 0 (CU):
df1: (379-H and 625-H)
Parameters 1 2 3
Output Energy, (Wh/h) 1.37 0.63 0.657
df2: (the rest)
Parameters 1 2 3
Output Energy, (Wh/h) 0.69 0.74 0.65
I can get the mean for all using by grouping level 1:
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all').groupby(level=1).mean()
but how do I group these according to level 0?
SOLUTION:
lightsonly = ["379-H", "625-H"]
df = df.apply(pd.to_numeric, errors='coerce').dropna(how='all')
mask = df.index.get_level_values(0).isin(lightsonly)
df1 = df[mask].groupby(level=1).mean()
df2 = df[~mask].groupby(level=1).mean()
Use get_level_values + isin for True and False index and then get mean with rename by dict:
d = {True: '379-H and 625-H', False: 'the rest'}
df.index = df.index.get_level_values(0).isin(['379-H', '625-H'])
df = df.mean(level=0).rename(d)
print (df)
1 2 3
the rest 0.691 0.7485 0.650
379-H and 625-H 1.370 0.6395 0.657
For separately dfs is possible also use boolean indexing:
mask= df.index.get_level_values(0).isin(['379-H', '625-H'])
df1 = df[mask].mean().rename('379-H and 625-H').to_frame().T
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
df2 = df[~mask].mean().rename('the rest').to_frame().T
print (df2)
1 2 3
the rest 0.691 0.7485 0.65
Another numpy solution with DataFrame constructor:
a1 = df[mask].values.mean(axis=0)
#alternatively
#a1 = df.values[mask].mean(axis=0)
df1 = pd.DataFrame(a1.reshape(-1, len(a1)), index=['379-H and 625-H'], columns=df.columns)
print (df1)
1 2 3
379-H and 625-H 1.37 0.6395 0.657
Consider the dataframe df where CU and Parameters are assumed to be in the index.
1 2 3
CU Parameters
379-H Output Energy, (Wh/h) 0.045 0.055 0.042
349-J Output Energy, (Wh/h) 0.001 0.003 0.000
625-H Output Energy, (Wh/h) 2.695 1.224 1.272
626-F Output Energy, (Wh/h) 1.381 1.494 1.300
Then we can groupby the truth values of whether the first level values are in the list ['379-H', '625-H'].
m = {True: 'Main', False: 'Rest'}
l = ['379-H', '625-H']
g = df.index.get_level_values('CU').isin(l)
df.groupby(g).mean().rename(index=m)
1 2 3
Rest 0.691 0.7485 0.650
Main 1.370 0.6395 0.657
#Use a lambda function to change index to 2 groups and then groupby using the modified index.
df.groupby(by=lambda x:'379-H,625-H' if x[0] in ['379-H','625-H'] else 'Others').mean()
Out[22]:
1 2 3
379-H,625-H 1.370 0.6395 0.657
Others 0.691 0.7485 0.650

pandas long to wide multicolumn reshaping

I have a pandas data frame as follows:
request_id crash_id counter num_acc_x num_acc_y num_acc_z
745109.0 670140638.0 0 0.010 0.000 -0.045
745109.0 670140638.0 1 0.016 -0.006 -0.034
745109.0 670140638.0 2 0.016 -0.006 -0.034
my id vars are : "request_id" and "crash_id", the target vars are nu_acc_x, num_acc_y and num_acc_z
I would like to create a new DataFrame where target vars are wide reshaped, that is adding max(counter)*3 new vars like num_acc_x_0, num_acc_x_1, ... num_acc_y_0,num_acc_y_1,... num_acc_z_0, num_acc_z_1 possibly without having a pivot as final result (I would like a true DataFrame as in R).
Thanks in advance for the attention
I think you need set_index with unstack, last create columns names from MultiIndex by map:
df = df.set_index(['request_id','crash_id','counter']).unstack()
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034
Another solution with aggreagting duplicates with pivot_table:
df = df.pivot_table(index=['request_id','crash_id'], columns='counter', aggfunc='mean')
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034
df = df.groupby(['request_id','crash_id','counter']).mean().unstack()
df.columns = df.columns.map(lambda x: '{}_{}'.format(x[0], x[1]))
df = df.reset_index()
print (df)
request_id crash_id num_acc_x_0 num_acc_x_1 num_acc_x_2 \
0 745109.0 670140638.0 0.01 0.016 0.016
num_acc_y_0 num_acc_y_1 num_acc_y_2 num_acc_z_0 num_acc_z_1 \
0 0.0 -0.006 -0.006 -0.045 -0.034
num_acc_z_2
0 -0.034

Categories

Resources