Calculate normalized spectral entropy of time series - python

I have a time series data
c1= c(0.558642328,
0.567173803,
0.572518969,
0.579917556,
0.592155421,
0.600239837,
0.598955071,
0.608857572,
0.615442061,
0.613502347,
0.618076897,
0.626769781,
0.633930194,
0.645518577,
0.66773088,
0.68128165,
0.695552504,
0.6992836,
0.702771866,
0.700840271,
0.684032428,
0.665082645,
0.646948862,
0.621813893,
0.597888613,
0.577744126,
0.555984044,
0.533597678,
0.523645413,
0.522041142,
0.525437844,
0.53053292,
0.543152606,
0.549038792,
0.555300856,
0.563411331,
0.572663951,
0.584438777,
0.589476192,
0.604197562,
0.61670388,
0.624161184,
0.624345171,
0.629342985,
0.630379665,
0.620067096,
0.597480375,
0.576228619,
0.561285031,
0.543921304,
0.530826211,
0.519563568,
0.514228535,
0.515202665,
0.516663855,
0.525673366,
0.543545395,
0.551681638,
0.558951402,
0.566816133,
0.573842585,
0.578611696,
0.589180577,
0.603297615,
0.624550509,
0.641310155,
0.655093217,
0.668385196,
0.671600127,
0.658876967,
0.641041982,
0.605081463,
0.585503519,
0.556173635,
0.527428073,
0.502755737,
0.482510734,
0.453295642,
0.439938772,
0.428757811,
0.422361642,
0.40945864,
0.399504355,
0.412688798,
0.42684828,
0.456935656,
0.48355422,
0.513727218,
0.541630101,
0.559122121,
0.561763656,
0.572532833,
0.576761365,
0.576146233,
0.580199403,
0.584954906)
corresponding to dates
dates = seq(as.Date("2016-09-01"), as.Date("2020-07-30"), by=15)
What I want to do is compute normalized spectral entropy of this time series. I have found in literature that high value indicates high stability of a system.
I have found a function here: https://rdrr.io/cran/ForeCA/man/spectral_entropy.html, but cannot generate what I want.
New to this topic, and hence any interpretation would be helpful too.

Related

statistics program gives out values different from test sample

I wrote a program for a statistics educatory problem, simply put I was supposed to predict prices for the next 250 days, then extract the lowest and highest price from 10k tries of 250-day predictions.
I followed the instructions written on the problem to use the gauss method from the random module and use the mean and std of the given sample.
the highest and lowest prices in the test are in the range of 45-55 but I predict 18-88. is there a problem with my code or is it just not a good method for prediction.
from random import gauss
with open('AAPL_train.csv','r') as sheet: #we categorize the data here
Date=[]
Open=[]
High=[]
Low=[]
Close=[]
Adj_Close=[]
Volume=[]
for lines in sheet.readlines()[1:-1]:
words=lines.strip().split(',')
Date.append(words[0])
Open.append(float(words[1]))
High.append(float(words[2]))
Low.append(float(words[3]))
Close.append(float(words[4]))
Adj_Close.append(float(words[5]))
Volume.append(int(words[6]))
subtract=[] #find the pattern of price changing by finding the day-to-day changes
for i in range(1,len(Volume)):
subtract.append(Adj_Close[i]-Adj_Close[i-1])
mean=sum(subtract)/len(subtract) #find the mean and std of the change pattern
accum=0
for amount in subtract:
accum+= (amount-mean)**2
var=accum/len(subtract)
stdev=var**0.5
worst=[]
best=[]
def Getwb(): #a function to predict
index=Adj_Close[-1]
index_lst=[]
for i in range(250):
index+=gauss(mean,stdev)
index_lst.append(index)
worst=(min(index_lst))
best=(max(index_lst))
return worst,best
for i in range (10000): #try predicting 10000 times and then extract highest and lowest result
x,y=Getwb()
worst.append(x)
best.append(y)
print(min(worst))
print(max(best))

Algorithm for efficient portfolio optimization

I'm trying to find the best allocation for a portfolio based on backtesting data. As a general rule, I've divided stocks into large caps and small/mid caps and growth/value and want no more than 80% of my portfolio in large caps or 70% of my portfolio in value. I need an algorithm that will be flexible enough to use for more than two stocks. So far, what I have is (including a random class called Ticker):
randomBoolean=True
listOfTickers=[]
listOfLargeCaps=[]
listOfSmallMidCaps=[]
largeCapAllocation=0
listOfValue=[]
listOfGrowthBlend=[]
valueAllocation=0
while randomBoolean:
tickerName=input("What is the name of the ticker?")
tickerCap=input("What is the cap of the ticker?")
tickerAllocation=int(input("Around how much do you want to allocate in this ticker?"))
tickerValue=input("Is this ticker a Value, Growth, or Blend stock?")
tickerName=Ticker(tickerCap,tickerValue,tickerAllocation,tickerName)
listOfTickers.append(tickerName)
closer=input("Type DONE if you are finished. Type ENTER to continue entering tickers")
if closer=="DONE":
randomBoolean=False
for ticker in listOfTickers:
if ticker.cap==("Large" or "large"):
listOfLargeCaps.append(ticker)
else:
listOfSmallMidCaps.append(ticker)
if ticker.value==("Value" or "value"):
listOfValue.append(ticker)
else:
listOfGrowthBlend.append(ticker)
for largeCap in listOfLargeCaps:
largeCapAllocation +=largeCap.allocation
if largeCapAllocation>80:
#run a function that will readjust ticker stuff and decrease allocation to large cap stocks
for value in listOfValue:
valueAllocation+=value.allocation
if valueAllocation>70:
#run a function that will readjust ticker stuff and decrease allocation to value stocks
The "function" I have so far just iterates through -5 to 6 in a sort of
for i in range (-5,6):
ticker1AllocationPercent + i
ticker2AllocationPercent - i
#update the bestBalance if the new allocation is better
How would I modify this algorithm to work for 3, 4, 5, etc. stocks, and how would I go about changing the allocations for the large/small-mid cap stocks and such?
As mentioned in the above answer, typically Quadratic solver is used in such problems. You can use Quadratic solver available in Pyportfolio. See this link for more details.

Filter out/omit regions with rapid slope change

I am trying to find the slopes of data sets but have many different type of sets. Ideally, I would have a relatively straight line with some noise where I can take the average slope and std. This plot can be seen below:
However, sometimes some datasets will have sharp jumps (because the output data has to be resized) that interfere with the averaging and std processing.
Is there a technique or method to filter out and omit these sharp spikes? I'm thinking of identifying the spikes (by their max/min peaks) and then deleting the data within a region of the spike to clean up the data.
Not much code to show for these but I am using T for time, Y for the output data, and dYdT for the slope and graphing them.
plt.semilogy(T,Y)
plt.title('CUT QiES')
plt.xlabel('time')
plt.show()
plt.semilogy(T,dYdT)
plt.title('Slope')
plt.xlabel('time')
plt.show()
print('average slope:', np.average(dYdT))
print('std slope:', np.std(dYdT))
Here is an example dataset for T, Y, and dYdT
T = [2.34354 2.34632 2.3491 2.35188 2.35466 2.35744 2.36022 2.363 2.36578
2.36856 2.37134 2.37412 2.3769 2.37968 2.38246 2.38524 2.38802 2.3908
2.39358 2.39636 2.39914 2.40192 2.4047 2.40748 2.41026 2.41304 2.41582
2.4186 2.42138 2.42416 2.42694 2.42972 2.4325 2.43528 2.43806 2.44084
2.44362 2.4464 2.44918 2.45196 2.45474 2.45752 2.4603 2.46308 2.46586
2.46864 2.47142 2.4742 2.47698 2.47976 2.48254 2.48532 2.4881 2.49088
2.49366 2.49644 2.49922 2.502 2.50478 2.50756 2.51034 2.51312 2.5159
2.51868 2.52146 2.52424 2.52702 2.5298 2.53258 2.53536 2.53814 2.54092
2.5437 2.54648 2.54926 2.55204 2.55482 2.5576 2.56038 2.56316 2.56594
2.56872 2.5715 2.57428 2.57706 2.57984 2.58262 2.5854 2.58818 2.59096
2.59374 2.59652 2.5993 2.60208 2.60486 2.60764 2.61042 2.6132 2.61598
2.61876 2.62154 2.62432 2.6271 2.62988 2.63266 2.63544 2.63822 2.641
2.64378 2.64656 2.64934 2.65212 2.6549 2.65768 2.66046 2.66324 2.66602
2.6688 2.67158 2.67436 2.67714 2.67992 2.6827 2.68548 2.68826 2.69104
2.69382 2.6966 2.69938 2.70216 2.70494 2.70772 2.7105 2.71328 2.71606
2.71884 2.72162 2.7244 2.72718 2.72996 2.73274 2.73552 2.7383 2.74108
2.74386 2.74664 2.74942 2.7522 2.75498 2.75776 2.76054 2.76332 2.7661
2.76888 2.77166 2.77444 2.77722 2.78 2.78278 2.78556 2.78834 2.79112
2.7939 2.79668 2.79946 2.80224 2.80502 2.8078 2.81058 2.81336 2.81614
2.81892 2.8217 2.82448 2.82726 2.83004 2.83282 2.8356 2.83838 2.84116
2.84394 2.84672 2.8495 2.85228 2.85506 2.85784 2.86062 2.8634 2.86618
2.86896 2.87174 2.87452 2.8773 2.88008 2.88286 2.88564 2.88842 2.8912
2.89398 2.89676 2.89954 2.90232 2.9051 2.90788 2.91066 2.91344 2.91622
2.919 2.92178]
Y = [1.4490e+24 4.1187e+24 1.1708e+25 3.3279e+25 9.4596e+25 2.6889e+26
7.6435e+26 2.1727e+27 6.1762e+27 1.7556e+28 1.6093e-01 4.5747e-01
1.3004e+00 3.6967e+00 1.0508e+01 2.9872e+01 8.4918e+01 2.4140e+02
6.8623e+02 1.9508e+03 5.5455e+03 1.5764e+04 4.4814e+04 1.2739e+05
3.6214e+05 1.0295e+06 2.9265e+06 8.3192e+06 2.3649e+07 6.7227e+07
1.9111e+08 5.4325e+08 1.5443e+09 4.3899e+09 1.2479e+10 3.5473e+10
1.0084e+11 2.8663e+11 8.1479e+11 2.3161e+12 6.5837e+12 1.8714e+13
5.3197e+13 1.5121e+14 4.2983e+14 1.2218e+15 3.4730e+15 9.8721e+15
2.8062e+16 7.9766e+16 2.2674e+17 6.4450e+17 1.8320e+18 5.2075e+18
1.4803e+19 4.2077e+19 1.1961e+20 3.3998e+20 9.6642e+20 2.7471e+21
7.8089e+21 2.2197e+22 6.3098e+22 1.7936e+23 5.0986e+23 1.4493e+24
4.1199e+24 1.1711e+25 3.3291e+25 9.4636e+25 2.6902e+26 7.6473e+26
2.1739e+27 6.1796e+27 1.7567e+28 1.6088e-01 4.5734e-01 1.3001e+00
3.6957e+00 1.0506e+01 2.9864e+01 8.4893e+01 2.4132e+02 6.8600e+02
1.9501e+03 5.5434e+03 1.5758e+04 4.4794e+04 1.2733e+05 3.6196e+05
1.0289e+06 2.9248e+06 8.3141e+06 2.3634e+07 6.7181e+07 1.9097e+08
5.4284e+08 1.5431e+09 4.3863e+09 1.2468e+10 3.5442e+10 1.0075e+11
2.8638e+11 8.1404e+11 2.3140e+12 6.5776e+12 1.8697e+13 5.3148e+13
1.5108e+14 4.2944e+14 1.2207e+15 3.4700e+15 9.8637e+15 2.8038e+16
7.9701e+16 2.2656e+17 6.4401e+17 1.8307e+18 5.2039e+18 1.4793e+19
4.2049e+19 1.1953e+20 3.3978e+20 9.6587e+20 2.7456e+21 7.8048e+21
2.2186e+22 6.3068e+22 1.7928e+23 5.0963e+23 1.4487e+24 4.1181e+24
1.1706e+25 3.3277e+25 9.4595e+25 2.6890e+26 7.6439e+26 2.1729e+27
6.1767e+27 1.7558e+28 1.6085e-01 4.5724e-01 1.2998e+00 3.6947e+00
1.0503e+01 2.9855e+01 8.4867e+01 2.4124e+02 6.8576e+02 1.9493e+03
5.5412e+03 1.5751e+04 4.4775e+04 1.2728e+05 3.6179e+05 1.0284e+06
2.9234e+06 8.3100e+06 2.3622e+07 6.7147e+07 1.9087e+08 5.4256e+08
1.5423e+09 4.3840e+09 1.2462e+10 3.5424e+10 1.0070e+11 2.8624e+11
8.1366e+11 2.3129e+12 6.5747e+12 1.8689e+13 5.3126e+13 1.5102e+14
4.2928e+14 1.2203e+15 3.4688e+15 9.8604e+15 2.8029e+16 7.9677e+16
2.2649e+17 6.4383e+17 1.8302e+18 5.2025e+18 1.4789e+19 4.2039e+19
1.1950e+20 3.3970e+20 9.6565e+20 2.7450e+21 7.8030e+21 2.2181e+22
6.3052e+22 1.7923e+23 5.0950e+23 1.4483e+24 4.1170e+24 1.1703e+25
3.3267e+25 9.4566e+25 2.6882e+26 7.6414e+26 2.1721e+27 6.1745e+27
1.7552e+28 1.6085e-01 4.5723e-01 1.2997e+00 3.6946e+00]
dYdT = [ 375.78737971 375.77838714 375.80388102 375.77489158
375.78727498 375.78675618 375.79979331 375.79140766
375.80307981 375.78869648 -24051.07160929 375.80641042
375.79707901 375.81605048 375.79005109 375.82183992
375.81456769 375.81626751 375.81206469 375.82085444
375.80836087 375.80650046 375.82436377 375.80311386
375.81929146 375.82649247 375.80356916 375.81256353
375.81105553 375.81083518 375.81806689 375.79871965
375.81165219 375.80421386 375.80603732 375.80022022
375.8141218 375.77592917 375.80511723 375.7948217
375.79574124 375.78237753 375.80219314 375.77971056
375.79862682 375.78801386 375.78906206 375.78914497
375.79271626 375.78453256 375.79375471 375.78087618
375.78731038 375.78835536 375.80214704 375.7810828
375.80402049 375.77350442 375.79558632 375.79229009
375.7979493 375.78886178 375.80285205 375.79348417
375.80619385 375.79128849 375.80870862 375.79125174
375.81241695 375.80966301 375.80855141 375.80471369
375.81123673 375.80242985 375.81604229 -24051.40870014
375.81595371 375.81631899 375.80172556 375.8188996
375.7939637 375.80499941 375.80295453 375.81071013
375.81233977 375.80121497 375.80580644 375.80072988
375.79422292 375.80991615 375.79562622 375.80425607
375.8009951 375.80341171 375.79284757 375.80067561
375.79074445 375.80361218 375.78872914 375.78392619
375.80294807 375.80742564 375.7832372 375.78773547
375.79978522 375.78860014 375.78890145 375.79762286
375.80180688 375.78148795 375.79054293 375.80220477
375.79379778 375.79114431 375.79906468 375.80132293
375.79296526 375.80555123 375.7949414 375.80782426
375.78471547 375.80279891 375.80250484 375.8024821
375.80059702 375.80550314 375.79947132 375.81008995
375.80407198 375.80436789 375.80464364 375.80046381
375.79483415 375.81472546 375.80509096 375.80393624
375.80523987 375.80569415 375.79908899 375.80055332
-24051.29144699 375.80437535 375.81196702 375.78739346
375.81351451 375.78827315 375.80323568 375.79387173
375.80410925 375.79061168 375.80602519 375.78876673
375.80794699 375.80555255 375.78221186 375.78976345
375.80687988 375.79578647 375.79815551 375.79344053
375.79436049 375.79356487 375.80266523 375.78659745
375.79944765 375.79336058 375.81159827 375.78590649
375.79567215 375.7967047 375.80100764 375.79358484
375.80263853 375.80785172 375.79032673 375.80669873
375.79567722 375.79784997 375.79602616 375.80621348
375.79850067 375.80356916 375.80784654 375.79641323
375.80733162 375.79643801 375.79806206 375.80809489
375.80524264 375.80392238 375.80115114 375.80136383
375.79989791 375.79500521 375.8129336 375.79707955
375.80370071 375.79873252 375.79881131 375.80290963
375.80719684 375.79460713 375.79090002 375.80340505
375.80575394 -24051.16850348 375.79650823 375.79215864
375.80533293]
I found a way to identify the outlier points based on using the median to find the most common data point and deleting any points outside a range from the median values. The code is as follows:
med = statistics.median(dYdT) #find median of dataset
perc = .05 #cutoff threshold
med_min = med - perc*med #min cuttof
med_max = med + perc*med #max cutoff
#find the locations of the sharp changes
peak_loc = []
peak_loc.extend(np.where(dYdT < med_min)[0])
peak_loc.extend(np.where(dYdT > med_max)[0])
#delete the data where the peaks are located
dYdT = np.delete(dYdT, peak_loc, axis=0)
Y = np.delete(Y, peak_loc, axis=0)
T = np.delete(T, peak_loc, axis=0)
Here are plots showing how it deleted points with rapid varying slopes:
Example 1
Example 2
And proof that datasets that don't need cleaning are unaffected:
You can adjust the perc variable to make the cutoff move or less lax as you need.
You can use a median filter to remove outliers from a signal. More specifically you can use medfilt of scipy.signal. Here is the result of scipy.signal.medfilt(dYdT, 3):

Why pdf values are Nan with hypsecant distribution?

I looked into the question Best fit Distribution plots and I found that best fit distribution for my dataset is hypsecant distribution. When I ran the code with one part of my dataset, I got the parameters for loc and scale to be:
loc=0.040593036736931196 ; scale=-0.008338984851829193
however, when I input for instance this data into hysecant.pdf with loc and scale, I get Nans as a result instead of values. Why is that?
Here is my input:
[ 1. 0.99745057 0.93944708 0.83424019 0.78561204 0.63211263
0.62151259 0.57883015 0.57512492 0.43878694 0.43813347 0.37548233
0.33836774 0.29610761 0.32102182 0.31378472 0.24809515 0.24638145
0.22580595 0.18480387 0.19404362 0.18919147 0.16377272 0.16954728
0.10912106 0.12407758 0.12819846 0.11673824 0.08957689 0.10353764
0.09469576 0.08336001 0.08591166 0.06309568 0.07445366 0.07062173
0.05535625 0.05682546 0.06803674 0.05217558 0.0492794 0.05403819
0.04535857 0.04562529 0.04259798 0.03830373 0.0374102 0.03217575
0.03291147 0.0288506 0.0268235 0.02467415 0.02409625 0.02486308
-0.02563436 -0.02801487 -0.02937738 -0.02948851 -0.03272476 -0.03324265
-0.03435844 -0.0383104 -0.03864602 -0.04091095 -0.04269355 -0.04428056
-0.05009069 -0.05037519 -0.05122204 -0.05770342 -0.06348465 -0.06468936
-0.06849683 -0.07477151 -0.08893675 -0.097383 -0.1033376 -0.10796748
-0.11835636 -0.13741154 -0.14920072 -0.16698451 -0.1715277 -0.20449029
-0.22241856 -0.25270058 -0.25699927 -0.26731036 -0.31098857 -0.35426224
-0.36204168 -0.44059844 -0.46754863 -0.53560093 -0.61463112 -0.65583547
-0.66378605 -0.70644849 -0.75217157 -0.92236344]

How to plot a document topic distribution in structural topic modeling R-package?

If I am using python Sklearn for LDA topic modeling, I can use the transform function to get a "document topic distribution" of the LDA-results like here:
document_topic_distribution = lda_model.transform(document_term_matrix)
Now I tried also the R structural topic models (stm) package and i want get the same. Is there any function in the stm package, which can produce the same thing (document topic distribution)?
I have the stm-object created as follows:
stm_model <- stm(documents = out$documents, vocab = out$vocab,
K = number_of_topics, data = out$meta,
max.em.its = 75, init.type = "Spectral" )
But i didn't find out how I can get the desired distribution out of this object. The documentation didn't really help me aswell.
As emilliman5 pointed out, your stm_model provides access to the underlying parameters of the model, as is shown in the documentation.
Indeed, the theta parameter is a
Number of Documents by Number of Topics matrix of topic proportions.
This requires some linguistical parsing: it is an N_DOCS by N_TOPICS matrix, i.e. it has N_DOCS rows, one per document, and N_TOPICS columns, one per topic. The values are the topic proportions, i.e. if stm_model[1, ] == c(.3, .2, .5), that means Document 1 is 30% Topic 1, 20% Topic 2 and 50% Topic 3.
To find out what topic dominates a document, you have to find the (column!) index of the maximum value, which can be retrieved e.g. by calling apply with MARGIN=1, which basically says "do this row-wise"; which.max simply returns the index of the maximum value:
apply(stm_model$theta, MARGIN=1, FUN=which.max)

Categories

Resources