statistics program gives out values different from test sample

statistics program gives out values different from test sample - python

I wrote a program for a statistics educatory problem, simply put I was supposed to predict prices for the next 250 days, then extract the lowest and highest price from 10k tries of 250-day predictions.
I followed the instructions written on the problem to use the gauss method from the random module and use the mean and std of the given sample.
the highest and lowest prices in the test are in the range of 45-55 but I predict 18-88. is there a problem with my code or is it just not a good method for prediction.
from random import gauss
with open('AAPL_train.csv','r') as sheet: #we categorize the data here
Date=[]
Open=[]
High=[]
Low=[]
Close=[]
Adj_Close=[]
Volume=[]
for lines in sheet.readlines()[1:-1]:
words=lines.strip().split(',')
Date.append(words[0])
Open.append(float(words[1]))
High.append(float(words[2]))
Low.append(float(words[3]))
Close.append(float(words[4]))
Adj_Close.append(float(words[5]))
Volume.append(int(words[6]))
subtract=[] #find the pattern of price changing by finding the day-to-day changes
for i in range(1,len(Volume)):
subtract.append(Adj_Close[i]-Adj_Close[i-1])
mean=sum(subtract)/len(subtract) #find the mean and std of the change pattern
accum=0
for amount in subtract:
accum+= (amount-mean)**2
var=accum/len(subtract)
stdev=var**0.5
worst=[]
best=[]
def Getwb(): #a function to predict
index=Adj_Close[-1]
index_lst=[]
for i in range(250):
index+=gauss(mean,stdev)
index_lst.append(index)
worst=(min(index_lst))
best=(max(index_lst))
return worst,best
for i in range (10000): #try predicting 10000 times and then extract highest and lowest result
x,y=Getwb()
worst.append(x)
best.append(y)
print(min(worst))
print(max(best))

Related

Calculate normalized spectral entropy of time series

I have a time series data
c1= c(0.558642328,
0.567173803,
0.572518969,
0.579917556,
0.592155421,
0.600239837,
0.598955071,
0.608857572,
0.615442061,
0.613502347,
0.618076897,
0.626769781,
0.633930194,
0.645518577,
0.66773088,
0.68128165,
0.695552504,
0.6992836,
0.702771866,
0.700840271,
0.684032428,
0.665082645,
0.646948862,
0.621813893,
0.597888613,
0.577744126,
0.555984044,
0.533597678,
0.523645413,
0.522041142,
0.525437844,
0.53053292,
0.543152606,
0.549038792,
0.555300856,
0.563411331,
0.572663951,
0.584438777,
0.589476192,
0.604197562,
0.61670388,
0.624161184,
0.624345171,
0.629342985,
0.630379665,
0.620067096,
0.597480375,
0.576228619,
0.561285031,
0.543921304,
0.530826211,
0.519563568,
0.514228535,
0.515202665,
0.516663855,
0.525673366,
0.543545395,
0.551681638,
0.558951402,
0.566816133,
0.573842585,
0.578611696,
0.589180577,
0.603297615,
0.624550509,
0.641310155,
0.655093217,
0.668385196,
0.671600127,
0.658876967,
0.641041982,
0.605081463,
0.585503519,
0.556173635,
0.527428073,
0.502755737,
0.482510734,
0.453295642,
0.439938772,
0.428757811,
0.422361642,
0.40945864,
0.399504355,
0.412688798,
0.42684828,
0.456935656,
0.48355422,
0.513727218,
0.541630101,
0.559122121,
0.561763656,
0.572532833,
0.576761365,
0.576146233,
0.580199403,
0.584954906)
corresponding to dates
dates = seq(as.Date("2016-09-01"), as.Date("2020-07-30"), by=15)
What I want to do is compute normalized spectral entropy of this time series. I have found in literature that high value indicates high stability of a system.
I have found a function here: https://rdrr.io/cran/ForeCA/man/spectral_entropy.html, but cannot generate what I want.
New to this topic, and hence any interpretation would be helpful too.

Algorithm for efficient portfolio optimization

I'm trying to find the best allocation for a portfolio based on backtesting data. As a general rule, I've divided stocks into large caps and small/mid caps and growth/value and want no more than 80% of my portfolio in large caps or 70% of my portfolio in value. I need an algorithm that will be flexible enough to use for more than two stocks. So far, what I have is (including a random class called Ticker):
randomBoolean=True
listOfTickers=[]
listOfLargeCaps=[]
listOfSmallMidCaps=[]
largeCapAllocation=0
listOfValue=[]
listOfGrowthBlend=[]
valueAllocation=0
while randomBoolean:
tickerName=input("What is the name of the ticker?")
tickerCap=input("What is the cap of the ticker?")
tickerAllocation=int(input("Around how much do you want to allocate in this ticker?"))
tickerValue=input("Is this ticker a Value, Growth, or Blend stock?")
tickerName=Ticker(tickerCap,tickerValue,tickerAllocation,tickerName)
listOfTickers.append(tickerName)
closer=input("Type DONE if you are finished. Type ENTER to continue entering tickers")
if closer=="DONE":
randomBoolean=False
for ticker in listOfTickers:
if ticker.cap==("Large" or "large"):
listOfLargeCaps.append(ticker)
else:
listOfSmallMidCaps.append(ticker)
if ticker.value==("Value" or "value"):
listOfValue.append(ticker)
else:
listOfGrowthBlend.append(ticker)
for largeCap in listOfLargeCaps:
largeCapAllocation +=largeCap.allocation
if largeCapAllocation>80:
#run a function that will readjust ticker stuff and decrease allocation to large cap stocks
for value in listOfValue:
valueAllocation+=value.allocation
if valueAllocation>70:
#run a function that will readjust ticker stuff and decrease allocation to value stocks
The "function" I have so far just iterates through -5 to 6 in a sort of
for i in range (-5,6):
ticker1AllocationPercent + i
ticker2AllocationPercent - i
#update the bestBalance if the new allocation is better
How would I modify this algorithm to work for 3, 4, 5, etc. stocks, and how would I go about changing the allocations for the large/small-mid cap stocks and such?

As mentioned in the above answer, typically Quadratic solver is used in such problems. You can use Quadratic solver available in Pyportfolio. See this link for more details.

Finding minimum value of a function wit 11,390,625 variable combinations

I am working on a code to solve for the optimum combination of diameter size of number of pipelines. The objective function is to find the least sum of pressure drops in six pipelines.
As I have 15 choices of discrete diameter sizes which are [2,4,6,8,12,16,20,24,30,36,40,42,50,60,80] that can be used for any of the six pipelines that I have in the system, the list of possible solutions becomes 15^6 which is equal to 11,390,625
To solve the problem, I am using Mixed-Integer Linear Programming using Pulp package. I am able to find the solution for the combination of same diameters (e.g. [2,2,2,2,2,2] or [4,4,4,4,4,4]) but what I need is to go through all combinations (e.g. [2,4,2,2,4,2] or [4,2,4,2,4,2] to find the minimum. I attempted to do this but the process is taking a very long time to go through all combinations. Is there a faster way to do this ?
Note that I cannot calculate the pressure drop for each pipeline as the choice of diameter will affect the total pressure drop in the system. Therefore, at anytime, I need to calculate the pressure drop of each combination in the system.
I also need to constraint the problem such that the rate/cross section of pipeline area > 2.
Your help is much appreciated.
The first attempt for my code is the following:
from pulp import *
import random
import itertools
import numpy
rate = 5000
numberOfPipelines = 15
def pressure(diameter):
diameterList = numpy.tile(diameter,numberOfPipelines)
pressure = 0.0
for pipeline in range(numberOfPipelines):
pressure += rate/diameterList[pipeline]
return pressure
diameterList = [2,4,6,8,12,16,20,24,30,36,40,42,50,60,80]
pipelineIds = range(0,numberOfPipelines)
pipelinePressures = {}
for diameter in diameterList:
pressures = []
for pipeline in range(numberOfPipelines):
pressures.append(pressure(diameter))
pressureList = dict(zip(pipelineIds,pressures))
pipelinePressures[diameter] = pressureList
print 'pipepressure', pipelinePressures
prob = LpProblem("Warehouse Allocation",LpMinimize)
use_diameter = LpVariable.dicts("UseDiameter", diameterList, cat=LpBinary)
use_pipeline = LpVariable.dicts("UsePipeline", [(i,j) for i in pipelineIds for j in diameterList], cat = LpBinary)
## Objective Function:
prob += lpSum(pipelinePressures[j][i] * use_pipeline[(i,j)] for i in pipelineIds for j in diameterList)
## At least each pipeline must be connected to a diameter:
for i in pipelineIds:
prob += lpSum(use_pipeline[(i,j)] for j in diameterList) ==1
## The diameter is activiated if at least one pipelines is assigned to it:
for j in diameterList:
for i in pipelineIds:
prob += use_diameter[j] >= lpSum(use_pipeline[(i,j)])
## run the solution
prob.solve()
print("Status:", LpStatus[prob.status])
for i in diameterList:
if use_diameter[i].varValue> pressureTest:
print("Diameter Size",i)
for v in prob.variables():
print(v.name,"=",v.varValue)
This what I did for the combination part which took really long time.
xList = np.array(list(itertools.product(diameterList,repeat = numberOfPipelines)))
print len(xList)
for combination in xList:
pressures = []
for pipeline in range(numberOfPipelines):
pressures.append(pressure(combination))
pressureList = dict(zip(pipelineIds,pressures))
pipelinePressures[combination] = pressureList
print 'pipelinePressures',pipelinePressures

I would iterate through all combinations, I think you would run into memory problems otherwise trying to model ALL combinations in a MIP.
If you iterate through the problems perhaps using the multiprocessing library to use all cores, it shouldn't take long just remember only to hold information on the best combination so far, and not to try and generate all combinations at once and then evaluate them.
If the problem gets bigger you should consider Dynamic Programming Algorithms or use pulp with column generation.

How to plot a document topic distribution in structural topic modeling R-package?

If I am using python Sklearn for LDA topic modeling, I can use the transform function to get a "document topic distribution" of the LDA-results like here:
document_topic_distribution = lda_model.transform(document_term_matrix)
Now I tried also the R structural topic models (stm) package and i want get the same. Is there any function in the stm package, which can produce the same thing (document topic distribution)?
I have the stm-object created as follows:
stm_model <- stm(documents = out$documents, vocab = out$vocab,
K = number_of_topics, data = out$meta,
max.em.its = 75, init.type = "Spectral" )
But i didn't find out how I can get the desired distribution out of this object. The documentation didn't really help me aswell.

As emilliman5 pointed out, your stm_model provides access to the underlying parameters of the model, as is shown in the documentation.
Indeed, the theta parameter is a
Number of Documents by Number of Topics matrix of topic proportions.
This requires some linguistical parsing: it is an N_DOCS by N_TOPICS matrix, i.e. it has N_DOCS rows, one per document, and N_TOPICS columns, one per topic. The values are the topic proportions, i.e. if stm_model[1, ] == c(.3, .2, .5), that means Document 1 is 30% Topic 1, 20% Topic 2 and 50% Topic 3.
To find out what topic dominates a document, you have to find the (column!) index of the maximum value, which can be retrieved e.g. by calling apply with MARGIN=1, which basically says "do this row-wise"; which.max simply returns the index of the maximum value:
apply(stm_model$theta, MARGIN=1, FUN=which.max)

How is the Vader 'compound' polarity score calculated in Python NLTK?

I'm using the Vader SentimentAnalyzer to obtain the polarity scores. I used the probability scores for positive/negative/neutral before, but I just realized the "compound" score, ranging from -1 (most neg) to 1 (most pos) would provide a single measure of polarity. I wonder how the "compound" score computed. Is that calculated from the [pos, neu, neg] vector?

The VADER algorithm outputs sentiment scores to 4 classes of sentiments https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L441:
neg: Negative
neu: Neutral
pos: Positive
compound: Compound (i.e. aggregated score)
Let's walk through the code, the first instance of compound is at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L421, where it computes:
compound = normalize(sum_s)
The normalize() function is defined as such at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L107:
def normalize(score, alpha=15):
"""
Normalize the score to be between -1 and 1 using an alpha that
approximates the max expected value
"""
norm_score = score/math.sqrt((score*score) + alpha)
return norm_score
So there's a hyper-parameter alpha.
As for the sum_s, it is a sum of the sentiment arguments passed to the score_valence() function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L413
And if we trace back this sentiment argument, we see that it's computed when calling the polarity_scores() function at https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L217:
def polarity_scores(self, text):
"""
Return a float for sentiment strength based on the input text.
Positive values are positive valence, negative value are negative
valence.
"""
sentitext = SentiText(text)
#text, words_and_emoticons, is_cap_diff = self.preprocess(text)
sentiments = []
words_and_emoticons = sentitext.words_and_emoticons
for item in words_and_emoticons:
valence = 0
i = words_and_emoticons.index(item)
if (i < len(words_and_emoticons) - 1 and item.lower() == "kind" and \
words_and_emoticons[i+1].lower() == "of") or \
item.lower() in BOOSTER_DICT:
sentiments.append(valence)
continue
sentiments = self.sentiment_valence(valence, sentitext, item, i, sentiments)
sentiments = self._but_check(words_and_emoticons, sentiments)
Looking at the polarity_scores function, what it's doing is to iterate through the whole SentiText lexicon and checks with the rule-based sentiment_valence() function to assign the valence score to the sentiment https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L243, see Section 2.1.1 of http://comp.social.gatech.edu/papers/icwsm14.vader.hutto.pdf
So going back to the compound score, we see that:
the compound score is a normalized score of sum_s and
sum_s is the sum of valence computed based on some heuristics and a sentiment lexicon (aka. Sentiment Intensity) and
the normalized score is simply the sum_s divided by its square plus an alpha parameter that increases the denominator of the normalization function.
Is that calculated from the [pos, neu, neg] vector?
Not really =)
If we take a look at the score_valence function https://github.com/nltk/nltk/blob/develop/nltk/sentiment/vader.py#L411, we see that the compound score is computed with the sum_s before the pos, neg and neu scores are computed using _sift_sentiment_scores() that computes the invidiual pos, neg and neu scores using the raw scores from sentiment_valence() without the sum.
If we take a look at this alpha mathemagic, it seems the output of the normalization is rather unstable (if left unconstrained), depending on the value of alpha:
alpha=0:
alpha=15:
alpha=50000:
alpha=0.001:
It gets funky when it's negative:
alpha=-10:
alpha=-1,000,000:
alpha=-1,000,000,000:

"About the Scoring" section at the github repo has a description.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

statistics program gives out values different from test sample - python

Related

Calculate normalized spectral entropy of time series

Algorithm for efficient portfolio optimization

Finding minimum value of a function wit 11,390,625 variable combinations

How to plot a document topic distribution in structural topic modeling R-package?

How is the Vader 'compound' polarity score calculated in Python NLTK?

Categories

Resources